Processing math: 100%
Skip to yearly menu bar Skip to main content


Timezone: America/Chicago

Oral Session 3A: 3D Computer Vision Sat 14 Jun 09:00 a.m.  

Oral
Zhengqi Li · Richard Tucker · Forrester Cole · Qianqian Wang · Linyi Jin · Vickie Ye · Angjoo Kanazawa · Aleksander Holynski · Noah Snavely

[ Karl Dean Ballroom ]

Abstract
We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of the deep visual SLAM framework, and with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times.
Oral
Linyi Jin · Richard Tucker · Zhengqi Li · David Fouhey · Noah Snavely · Aleksander Holynski

[ Karl Dean Ballroom ]

Abstract
Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3r to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes.
Oral
Qianqian Wang · Yifei Zhang · Aleksander Holynski · Alexei A. Efros · Angjoo Kanazawa

[ Karl Dean Ballroom ]

Abstract
We propose a novel unified framework capable of solving a broad range of 3D tasks. At the core of our approach is an online stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, our method leverages the evolving state to generate metric-scale pointmaps for each input in an online manner. These pointmaps reside within a common coordinate system, accumulating into a coherent 3D scene reconstruction. Our model captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen structures beyond the coverage of the input images through a raymap probe. Our method is simple yet highly flexible, naturally accepting varying lengths of image sequences and working seamlessly with both video streams and unordered photo collections. We evaluate our method on various 3D/4D tasks including monocular/video depth estimation, camera estimation, multi-view reconstruction, and achieve competitive or state-of-the-art performance. Additionally, we showcase intriguing behaviors enabled by our state representation.
Oral
Yiran Wang · Jiaqi Li · Chaoyi Hong · Ruibo Li · Liusheng Sun · Xiao Song · Zhe Wang · Zhiguo Cao · Guosheng Lin

[ Karl Dean Ballroom ]

Abstract
Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.
Oral
Anagh Malik · Benjamin Attal · Andrew Xie · Matthew O’Toole · David B. Lindell

[ Karl Dean Ballroom ]

Abstract
We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching --- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view transient relighting of captured scenes.

Oral Session 3C: Vision and Language Sat 14 Jun 09:00 a.m.  

Oral
Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang

[ Davidson Ballroom ]

Abstract
The evolution of Large Vision-Language Models (LVLMs) has progressed from single-image understanding to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models like GPT-4o show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by these insights, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA effectively reduces position bias and significantly enhances the reasoning performance of existing LVLMs.
Oral
Zhihe Yang · Xufang Luo · Dongqi Han · Yunjian Xu · Dongsheng Li

[ Davidson Ballroom ]

Abstract
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26\% on the AMBER benchmark and 5.39\% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
Oral
Zicheng Zhang · Tengchuan Kou · Chunyi Li · Shushi Wang · Wei Sun · Wei Wang · Xiaoyu Li · ZongYu Wang · Xuezhi Cao · Xiongkuo Min · Xiaohong Liu · Guangtao Zhai

[ Davidson Ballroom ]

Abstract
Evaluating text-to-vision content hinges on two crucial aspects: **visual quality** and **alignment**. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to **Scaling Law**, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models.Therefore, we introduce a comprehensive dataset designed to **E**valuate **V**isual quality and **A**lignment **L**evel for text-to-vision content (**Q-EVAL-100K**), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects.The **Q-EVAL-100K** dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose **Q-Eval-Score**, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment.Experimental results indicate that the proposed **Q-Eval-Score** achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the **Q-EVAL-100K** dataset. **The data and code will be released** to help promote the generation models.
Oral
Jihan Yang · Shusheng Yang · Anjali W. Gupta · Rilyn Han · Li Fei-Fei · Saining Xie

[ Davidson Ballroom ]

Abstract
Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also "think in space" from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance awareness.
Oral
Andrew Szot · Bogdan Mazoure · Omar Attia · Aleksei Timofeev · Harsh Agrawal · R Devon Hjelm · Zhe Gan · Zsolt Kira · Alexander Toshev

[ Davidson Ballroom ]

Abstract
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

Oral Session 3B: Multimodal Computer Vision Sat 14 Jun 09:00 a.m.  

Oral
Kaiyu Li · Ruixun Liu · Xiangyong Cao · Xueru Bai · Feng Zhou · Deyu Meng · Wang Zhi

[ ExHall A2 ]

Abstract
Current remote sensing semantic segmentation methods are mostly built on the close-set assumption, meaning that the model can only recognize pre-defined categories that exist in the training set. However, in practical Earth observation, there are countless unseen categories, and manual annotation is impractical. To address this challenge, we first attempt to introduce training-free open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle these issues, we propose a simple and universal upsampler, i.e. SimFeatUp, to restore lost spatial information of deep features. Specifically, SimFeatUp only needs to learn from a few unlabeled images, and can upsample arbitrary remote sensing image features. Furthermore, based on the observation of the abnormal response of patch tokens to the [CLS] token in CLIP, we propose to execute a simple subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets of 4 tasks, including semantic segmentation, building extraction, road detection, and flood detection. Our method achieves an average of 5.8\%, 8.2\%, 4.0\%, and 15.3\% improvement over state-of-the-art methods on the 4 …
Oral
Ding Qi · Jian Li · Junyao Gao · Shuguang Dou · Ying Tai · Jianlong Hu · Bo Zhao · Yabiao Wang · Chengjie Wang · Cai Rong Zhao

[ ExHall A2 ]

Abstract
Dataset distillation (DD) condenses key information from large-scale datasets into smaller synthetic datasets, reducing storage and computational costs for training networks. However, recent research has primarily focused on image classification tasks, with limited expansion to detection and segmentation. Two key challenges remain: (i) Task Optimization Heterogeneity, where existing methods focus on class-level information and fail to address the diverse needs of detection and segmentation and (ii) Inflexible Image Generation, where current generation methods rely on global updates for single-class targets and lack localized optimization for specific object regions.To address these challenges, we propose a universal dataset distillation framework, named UniDD, a task-driven diffusion model for diverse DD tasks, as illustrated in Fig.1. Our approach operates in two stages: Universal Task Knowledge Mining, which captures task-relevant information through task-specific proxy model training, and Universal Task-Driven Diffusion, where these proxies guide the diffusion process to generate task-specific synthetic images.Extensive experiments across ImageNet-1K, Pascal VOC, and MS COCO demonstrate that UniDD consistently outperforms state-of-the-art methods. In particular, on ImageNet-1K with IPC-10, UniDD surpasses previous diffusion-based methods by 6.1\%, while also reducing deployment costs.
Oral
Jingyi Xu · Siwei Tu · Weidong Yang · Ben Fei · Shuhao Li · Keyi Liu · Yeqi Luo · Lipeng Ma · Lei Bai

[ ExHall A2 ]

Abstract
Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence have made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-module cooperative deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages a vision transformer to generate coarse yet superior forecasting results over previous methods at a regular 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next module. Subsequently, an unconditional diffusion model pre-trained on low-resolution sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with a 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.
Oral
Kunyu Wang · Xueyang Fu · Xin Lu · Chengjie Ge · Chengzhi Cao · Wei Zhai · Zheng-Jun Zha

[ ExHall A2 ]

Abstract
Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.
Oral
Jiaxin Cai · Jingze Su · Qi Li · Wenjie Yang · Shu Wang · Tiesong Zhao · Shengfeng He · Wenxi Liu

[ ExHall A2 ]

Abstract
Multimodal semantic segmentation is a critical challenge in computer vision, with early methods suffering from high computational costs and limited transferability due to full fine-tuning of RGB-based pre-trained parameters. Recent studies, while leveraging additional modalities as supplementary prompts to RGB, still predominantly rely on RGB, which restricts the full potential of other modalities. To address these issues, we propose a novel symmetric parameter-efficient fine-tuning framework for multimodal segmentation, featuring with a modality-aware prompting and adaptation scheme, to simultaneously adapt the capabilities of a powerful pre-trained model to both RGB and X modalities. Furthermore, prevalent approaches use the global cross-modality correlations of attention mechanism for modality fusion, which inadvertently introduces noise across modalities. To mitigate this noise, we propose a dynamic sparse cross-modality fusion module to facilitate effective and efficient cross-modality fusion. To further strengthen the above two modules, we propose a training strategy that leverages accurately predicted dual-modality results to self-teach the single-modality outcomes. In comprehensive experiments, we demonstrate that our method outperforms previous state-of-the-art approaches across six multimodal segmentation scenarios with minimal computation cost.

Poster Session 3 Sat 14 Jun 10:30 a.m.  

Poster
Peiwen Lai · Weizhi Zhong · Yipeng Qin · Xiaohang Ren · Baoyuan Wang · Guanbin Li

[ ExHall D ]

Abstract
Generating natural listener responses in conversational scenarios is crucial for creating engaging digital humans and avatars. Recent work has shown that large language models (LLMs) can be effectively leveraged for this task, demonstrating remarkable capabilities in generating contextually appropriate listener behaviors. However, current LLM-based methods face two critical limitations: they rely solely on speech content, overlooking other crucial communication signals, and they entangle listener identity with response generation, compromising output fidelity and generalization. In this work, we present a novel framework that addresses these limitations while maintaining the advantages of LLMs. Our approach introduces a Multimodal-LM architecture that jointly processes speech content, prosody, and speaker emotion, capturing the full spectrum of communication cues. Additionally, we propose an identity disentanglement strategy using instance normalization and adaptive instance normalization in a VQ-VAE framework, enabling high-fidelity listening head synthesis with flexible identity control. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of response naturalness and fidelity, while enabling effective identity control without retraining.
Poster
Yongming Zhu · Longhao Zhang · Zhengkun Rong · Tianshu Hu · Shuang Liang · Zhipengge

[ ExHall D ]

Abstract
Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method.
Poster
Jiazhi Guan · Kaisiyuan Wang · Zhiliang Xu · Quanwei Yang · Yasheng SUN · Shengyi He · Borong Liang · Yukang Cao · Yingying Li · Haocheng Feng · Errui Ding · Jingdong Wang · Youjian Zhao · Hang Zhou · Ziwei Liu

[ ExHall D ]

Abstract
Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.
Poster
Jiahe Li · Jiawei Zhang · Xiao Bai · Jin Zheng · Jun Zhou · Lin Gu

[ ExHall D ]

Abstract
Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Extensive experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking head videos.
Poster
Bohao Zhang · Xuejiao Wang · Changbo Wang · Gaoqi He

[ ExHall D ]

Abstract
Micro-expression recognition (MER) aims to uncover genuine emotions and underlying psychological states. However, existing MER methods struggle with three main challenges. 1) Scarcity of micro-expression samples. 2) Difficulty in modeling nearly imperceptible facial movements. 3) Reliance on apex frame annotations. To address these issues, we propose a Self-supervised Oriented Deformation model for Apex-free Micro-expression Recognition (SODA4MER). Our approach enhances local deformation perception using muscle-group priors and amplifies subtle features through Dynamic Stereotype Theory (DST) based enhancement, while contrastive learning eliminates the need for manual apex annotations. Specifically, the Oriented deformation estimator of SODA4MER is first pre-trained in a self-supervised manner. Secondly, a Gated Temporal Variance Gaussian model (GTVG) is introduced to adaptively integrate facial muscle-group priors, enhancing local deformation perception and mitigating noise from head movements. Then, contrastive learning is employed to achieve apex detection by identifying the frame with the most significant local deformation. Finally, guided by DST, we introduced a feature enhancement strategy that models the temporal dynamics of local deformation in the activation and decay phases, leading to richer deformation features. Our rigorous experiments confirm the competitive performance and practical applicability of SODA4MER.
Poster
Shengze Wang · Xueting Li · Chao Liu · Matthew Chan · Michael Stengel · Henry Fuchs · Shalini De Mello · Koki Nagano

[ ExHall D ]

Abstract
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets.
Poster
Jianchuan Chen · Jingchuan Hu · Gaige Wang · Zhonghua Jiang · Tiansong Zhou · Zhiwen Chen · Chengfei Lv

[ ExHall D ]

Abstract
Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we "bake" the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.
Poster
Wojciech Zielonka · Stephan J. Garbin · Alexandros Lattas · George Kopanas · Paulo Gotardo · Thabo Beeler · Justus Thies · Timo Bolkart

[ ExHall D ]

Abstract
We present SynShot, a novel method for few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular methods that require thousands of real training images, SynShot significantly …
Poster
Linzhou Li · Yumeng Li · Yanlin Weng · Youyi Zheng · Kun Zhou

[ ExHall D ]

Abstract
We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. Unlike prior approaches that utilize linear bases from 3D morphable models (3DMM) to model Gaussian blendshapes, our method maps tracked 3DMM parameters into reduced blendshape weights with an MLP, leading to a compact set of blendshape bases. The learned compact base composition effectively captures essential facial details for specific individuals, and does not rely on the fixed base composition weights of 3DMM, leading to enhanced reconstruction quality and higher efficiency. To further expedite the reconstruction process, we develop a novel color initialization estimation method and a batch-parallel Gaussian rasterization process, achieving state-of-the-art quality with training throughput of about 630 images per second. Moreover, we propose a local-global sampling strategy that enables direct on-the-fly reconstruction, immediately reconstructing the model as video streams in real time while achieving quality comparable to offline settings.
Poster
Hongyu Liu · Xuan Wang · Ziyu Wan · Yue Ma · Jingye Chen · Yanbo Fan · Yujun Shen · Yibing Song · Qifeng Chen

[ ExHall D ]

Abstract
This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style.We select parametric triplanes as the intermediate 4D representation, and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models.Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions.A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains.The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator.Extensive experiments suggest that our model, termed \method, is capable of producing high-quality 4D avatars with strong robustness to various source image domains.The code, the data, and the models will be made publicly available to facilitate future studies.
Poster
Dimitrios Gerogiannis · Foivos Paraperas Papantoniou · Rolandos Alexandros Potamias · Alexandros Lattas · Stefanos Zafeiriou

[ ExHall D ]

Abstract
Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in reconstructing detailed 3D scenes within multi-view setups and the emergence of large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based method utilizing a human face foundation model as guidance with just a single image as input. To achieve that, we extend such a model for diverse-view human head generation by fine-tuning on synthetic data and modifying its conditioning. Our avatars maintain a dense correspondence with a human face mesh template, allowing blendshape-based expression generation. This is achieved through a modified 3DGS approach, connectivity regularizers, and a strategic initialization tailored for our task. Additionally, we propose an optional efficient SDS-based correction step to refine the blendshape expressions, enhancing realism and diversity. Experiments demonstrate that Arc2Avatar achieves state-of-the-art realism and identity preservation, effectively addressing color issues by allowing the use of very low guidance, enabled by our strong identity prior and initialization strategy, without compromising detail.
Poster
Zhiyang Guo · Jinxu Xiang · Kai Ma · Wengang Zhou · Houqiang Li · Ran Zhang

[ ExHall D ]

Abstract
3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach generates animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed. The source code will be made publicly available.
Poster
Tianyi Xie · Yiwei Zhao · Ying Jiang · Chenfanfu Jiang

[ ExHall D ]

Abstract
Creating hand-drawn animation sequences is labor-intensive and demands professional expertise. We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. Our method seamlessly integrates physics-based simulations with data-driven generative models to produce dynamic and visually compelling animations. To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries. We enhance artistic control by introducing customizable energy strokes and incorporating rigging point support, enabling the creation of tailored animation effects such as wind interactions. Finally, we extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames. The resulting animations exhibit temporal consistency and visual plausibility, demonstrating the effectiveness of our method in creating dynamic anime-style animations.
Poster
Sohyun Jeong · Taewoong Kang · Hyojin Jang · Jaegul Choo

[ ExHall D ]

Abstract
With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques—integrating an entire head from the head image with the body from the body image—has increased. However, traditional head-swapping methods heavily rely on face-centered cropped data with primarily frontal-facing views, which limits their effectiveness in real-world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context-aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging …
Poster
Zhiyu Qu · Yunqi Miao · Zhensong Zhang · Jifei Song · Jiankang Deng · Yi-Zhe Song

[ ExHall D ]

Abstract
We present CaricatureBooth, a system that transforms caricature creation into a simple interactive experience -- as easy as using a photo booth! A key challenge in caricature generation is two-fold: the scarcity of high-quality caricature data and the difficulty in enabling precise creative control over the exaggeration process while maintaining identity. Prior approaches either require large-scale caricature and photo data or lack intuitive mechanisms for users to guide the deformation without losing identity. We address the data scarcity by synthesising training data through Thin Plate Spline (TPS) deformation of standard face images. For creative control, we design a Bézier curve interface where users can easily manipulate facial features, with these edits then driving TPS transformations at inference time. When combined with a pre-trained ID-preserving diffusion model, our system maintains both identity preservation and creative flexibility. Through extensive experiments, we demonstrate that CaricatureBooth achieves state-of-the-art quality while making the joy of caricature creation as accessible as taking a photo -- just walk in and walk out with your personalised caricature! Code will be made available at the first instance to facilitate follow-up efforts.
Poster
Kwan Yun · Chaelin Kim · Hangyeul Shin · Junyong Noh

[ ExHall D ]

Abstract
Recent 3D face editing methods using masks have produced high-quality edited images by leveraging Neural Radiance Fields (NeRF). Despite their impressive performance, existing methods often provide limited user control due to the use of pre-trained segmentation masks. To utilize masks with a desired layout, an extensive training dataset is required, which is challenging to gather. We present FFaceNeRF, a NeRF-based face editing technique that can overcome the challenge of limited user control due to the use of fixed mask layouts. Our method employs a geometry adapter with feature injection, allowing for effective manipulation of geometry attributes. Additionally, we adopt latent mixing for tri-plane augmentation, which enables training with fewer samples. This facilitates rapid model adaptation to desired mask layouts, crucial for applications in fields like personalized medical imaging or creative face editing. Our comparative evaluations indicate that FFaceNeRF surpasses existing mask based face editing methods in terms of flexibility, control, and generated image quality, paving a way for future advancements in customized and high-fidelity 3D face editing.
Poster
Honghu Chen · Bo Peng · Yunfan Tao · Juyong Zhang

[ ExHall D ]

Abstract
We introduce D3-Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D3-Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation production.
Poster
Radu Alexandru Rosu · Keyu Wu · Yao Feng · Youyi Zheng · Michael J. Black

[ ExHall D ]

Abstract
We address the task of reconstructing 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data.Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image.First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that reconstructs accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data.Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. …
Poster
Hang Shao · lei luo · Jianjun Qian · Mengkai Yan · Shuo Chen · Jian Yang

[ ExHall D ]

Abstract
Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global \text{interference sharing, subject} background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.
Poster
Chen-Wei Chang · Cheng-De Fan · Chia-Che Chang · Yi-Chen Lo · Yu-Chee Tseng · Jiun-Long Huang · Yu-Lun Liu

[ ExHall D ]

Abstract
Color constancy methods often struggle to generalize across different camera sensors due to varying spectral sensitivities. We present GCC, which leverages diffusion models to inpaint color checkers into images for illumination estimation. Our key innovations include (1) a single-step deterministic inference approach that inpaints color checkers reflecting scene illumination, (2) a Laplacian composition technique that preserves checker structure while allowing illumination-dependent color adaptation, and (3) a mask-based data augmentation strategy for handling imprecise color checker annotations. GCC demonstrates superior robustness in cross-camera scenarios, achieving state-of-the-art worst-25% error rates of 5.22° and 4.32° in bi-directional evaluations. These results highlight our method's stability and generalization capability across different camera characteristics without requiring sensor-specific training, making it a versatile solution for real-world applications.
Poster
Daniel Feijoo · Juan C. Benito · Alvaro Garcia · Marcos Conde

[ ExHall D ]

Abstract
Photography during night or in dark conditions typically suffers from noise, low light and blurring issues due to the dim environment and the common use of long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are related under these conditions, most approaches in image restoration solve these tasks separately. In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. Instead of following the current tendency of Transformer-based models, we propose new attention mechanisms to enhance the receptive field of efficient CNNs. Our method reduces the computational costs in terms of parameters and MAC operations compared to previous methods. Our model, DarkIR, achieves new state-of-the-art results on the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize on real-world night images.
Poster
Mingde Yao · Menglu Wang · King Man Tam · Lingen Li · Tianfan Xue · Jinwei Gu

[ ExHall D ]

Abstract
Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolarRR, for polarization-based reflection removal, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolarRR dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in difficult reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset will be public after acceptance.
Poster
Benquan Wang · Ruyi An · Jin-Kyu So · Sergei Kurdiumov · Eng Aik Chan · Giorgio Adamo · Yuhan Peng · Yewen Li · Bo An

[ ExHall D ]

Abstract
Observing objects of small size has always been a charming pursuit of human beings.However, due to the physical phenomenon of diffraction, the optical resolution is restricted to approximately half the wavelength of light, which impedes the observation of subwavelength objects, typically smaller than 200 nm. This constrains its application in numerous scientific and industrial fields that aim to observe objects beyond the diffraction limit, such as native state coronavirus inspection.Fortunately, deep learning methods have shown remarkable potential in uncovering underlying patterns within data, promising to overcome the diffraction limit by revealing the mapping pattern between diffraction images and their corresponding ground truth object localization images. However, the absence of suitable datasets has hindered progress in this field - collecting high-quality optical data of subwavelength objects is very challenging as these objects are inherently invisible under conventional microscopy, making it impossible to perform standard visual calibration and drift correction. Therefore, in collaboration with top optical scientists, we provide the first general optical imaging dataset based on the "LEGO" concept for addressing the diffraction limit. Drawing an analogy to the modular construction of the LEGO blocks, we construct a comprehensive optical imaging dataset comprising subwavelength fundamental elements, *i.e.*, small square units that …
Poster
liqun.chen · Yuxuan Li · Jun Dai · Jinwei Gu · Tianfan Xue

[ ExHall D ]

Abstract
Accurate blur estimation is essential for high-performance imaging across various applications. Blur is typically represented by the point spread function (PSF). In this paper, we propose a physics-informed PSF learning framework for imaging system, consisting a simple calibration followed by a learning process. Our framework could achieve both high accuracy and universal applicability. Inspired by the Seidel PSF model for representing spatially varying PSF, we identify its limitations in optimization and introduce a novel wavefront-based PSF model accompanied by an optimization strategy, both reduce optimization complexity and improve estimation accuracy. Moreover, our wavefront-based PSF model is independent of lens parameters, eliminate the need for prior knowledge of the lens. To validate our approach, we compare it with recent PSF estimation methods (Degradation Transfer and Fast Two-step) through a deblurring task, where all the estimated PSFs are used to train state-of-the-art deblurring algorithms. Our approach demonstrates improvements in image quality in simulation, also showcase noticeable visual quality improvements on real captured images. Code and models are public.
Poster
Kevin Zhang · Jia-Bin Huang · Jose Echevarria · Stephen DiVerdi · Aaron Hertzmann

[ ExHall D ]

Abstract
We introduce MaDCoW, a method for correcting marginal distortion of arbitrary objects in wide-angle photography. People often use wide-angle photography to convey natural scenes—smartphones typically default to wide-angle photography—but depicting very wide-field-of-view scenes produces distorted object appearance, particularly marginal distortion in linear projections. With MaDCoW, a user annotates regions-of-interest to correct, along with straight lines. For each region, MaDCoW solves for a local-linear perspective projection and then jointly solves for a projection for the whole photograph that minimizes distortion. We show that our method can produce good results in cases where previous methods yield visible distortions.
Poster
Hadi Alzayer · Philipp Henzler · Jonathan T. Barron · Jia-Bin Huang · Pratul P. Srinivasan · Dor Verbin

[ ExHall D ]

Abstract
Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object's geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both simulated and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent shiny'' appearance which cannot be reconstructed by prior methods.
Poster
Chun Gu · Xiaofei Wei · Zixuan Zeng · Yuxuan Yao · Li Zhang

[ ExHall D ]

Abstract
In inverse rendering, accurately modeling visibility and indirect radiance for incident light is essential for capturing secondary effects. Due to the absence of a powerful Gaussian ray tracer, previous 3DGS-based methods have either adopted a simplified rendering equation or used learnable parameters to approximate incident light, resulting in inaccurate material and lighting estimations. To this end, we introduce the inter-reflective Gaussian splatting (IRGS) framework for inverse rendering. To capture inter-reflection, we apply the full rendering equation without simplification and compute incident radiance on the fly using the proposed differentiable 2D Gaussian ray tracing. Additionally, we present an efficient optimization scheme to handle the computational demands of Monte Carlo sampling for rendering equation evaluation. Furthermore, we introduce a novel strategy for querying the indirect radiance of incident light when relighting the optimized scenes. Extensive experiments on multiple standard benchmarks validate the effectiveness of IRGS, demonstrating its capability to accurately model complex inter-reflection effects.
Poster
Chinmay Talegaonkar · Yash Belhe · Ravi Ramamoorthi · Nicholas Antipa

[ ExHall D ]

Abstract
Recently, 3D Gaussian Splatting (3DGS) has enabled photorealistic view synthesis at high inference speeds.However, its splatting-based rendering model makes several approximations to the rendering equation, reducing physical accuracy.We show that splatting and its approximations are unnecessary, even within a rasterizer;we instead volumetrically integrate 3D Gaussians directly to compute the transmittance across them analytically.We use this analytic transmittance to derive more physically accurate alpha values than 3DGS, which can directly be used within their framework. The result is a method that more closely follows the volume rendering equation (similar to ray tracing) while enjoying the speed benefits of rasterization. Our method represents opaque surfaces with higher accuracy and fewer points than 3DGS.This enables it to outperform 3DGS for view synthesis (measured in SSIM and LPIPS).Being volumetrically consistent also enables our method to work out of the box for tomography. We match the state-of-the-art 3DGS-based tomography method with fewer points.
Poster
Federico Lincetto · Gianluca Agresti · Mattia Rossi · Pietro Zanuttigh

[ ExHall D ]

Abstract
Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability. For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices. Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.
Poster
Anagh Malik · Benjamin Attal · Andrew Xie · Matthew O’Toole · David B. Lindell

[ ExHall D ]

Abstract
We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching --- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view transient relighting of captured scenes.
Poster
Sean Wu · Shamik Basu · Tim Broedermann · Luc Van Gool · Christos Sakaridis

[ ExHall D ]

Abstract
We tackle the ill-posed inverse rendering problem in 3D reconstruction with a Neural Radiance Field (NeRF) approach informed by Physics-Based Rendering (PBR) theory, named PBR-NeRF. Our method addresses a key limitation in most NeRF and 3D Gaussian Splatting approaches: they estimate view-dependent appearance without modeling scene materials and illumination. To address this limitation, we present an inverse rendering (IR) model capable of jointly estimating scene geometry, materials, and illumination. Our model builds upon recent NeRF-based IR approaches, but crucially introduces two novel physics-based priors that better constrain the IR estimation. Our priors are rigorously formulated as intuitive loss terms and achieve state-of-the-art material estimation without compromising novel view synthesis quality. Our method is easily adaptable to other inverse rendering and 3D reconstruction frameworks that require material estimation. We demonstrate the importance of extending current neural rendering approaches to fully model scene properties beyond geometry and view-dependent appearance. Code will be made publicly available.
Poster
Haoyuan Wang · Zhenwei Wang · Xiaoxiao Long · Cheng Lin · Gerhard Hancke · Rynson W.H. Lau

[ ExHall D ]

Abstract
With advances in deep learning models and the availability of large-scale 3D datasets, we have recently witnessed significant progress in single-view 3D reconstruction. However, existing methods often fail to reconstruct physically based material properties given a single image, limiting their applicability in complicated scenarios. This paper presents a novel approach (MAGE) for generating 3D geometry with realistic decomposed material properties given a single image as input. Our method leverages inspiration from traditional computer graphics deferred rendering pipelines to introduce a multi-view G-buffer estimation model. The proposed model estimates G-buffers for various views as multi-domain images, including XYZ coordinates, normals, albedo, roughness, and metallic properties from the single-view RGB. Furthermore, to address the inherent ambiguity and inconsistency in generating G-buffers simultaneously, we formulate a deterministic network from the pretrained diffusion models and propose a lighting response loss that enforces consistency across these domains using PBR principles. We also propose a large-scale synthetic dataset rich in material diversity for our model training. Experimental results demonstrate the effectiveness of our method in producing high-quality 3D meshes with rich material properties. We will release the dataset and code.
Poster
Haolin Li · Jinyang Liu · Mario Sznaier · Octavia Camps

[ ExHall D ]

Abstract
Photo-realistic 3D Reconstruction is a fundamental problem in 3D computer vision. This domain has seen considerable advancements owing to the advent of recent neural rendering techniques. These techniques predominantly aim to focus on learning volumetric representations of 3D scenes and refining these representations via loss functions derived from rendering. Among these, 3D Gaussian Splatting (3D-GS) has emerged as a significant method, surpassing Neural Radiance Fields (NeRFs). 3D-GS uses parameterized 3D Gaussians for modeling both spatial locations and color information, combined with a tile-based fast rendering technique. Despite its superior rendering performance and speed, the use of 3D Gaussian kernels has inherent limitations in accurately representing discontinuous functions, notably at edges and corners for shape discontinuities, and across varying textures for color discontinuities. To address this problem, we propose to employ 3D Half-Gaussian (\textbf{3D-HGS}) kernels, which can be used as a plug-and-play kernel. Our experiments demonstrate their capability to improve the performance of current 3D-GS related methods and achieve state-of-the-art rendering performance on various datasets without compromising rendering speed. The code and trained models will be available on GitHub.
Poster
Junha Hyung · Kinam Kim · Susung Hong · Min-Jung Kim · Jaegul Choo

[ ExHall D ]

Abstract
Diffusion models have emerged as a powerful tool for generating high-quality images, videos, and 3D content. While sampling guidance techniques like CFG improve quality, they reduce diversity and motion. Autoguidance mitigates these issues but demands extra weak model training, limiting its practicality for large-scale models.In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models.STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include: (1) introducing STG as an efficient, high-performing guidance technique for video diffusion models, (2) eliminating the need for auxiliary models by simulating a weak model through layer skipping, and (3) ensuring quality-enhanced guidance without compromising sample diversity or dynamics unlike CFG.
Poster
Zhuoman Liu · Weicai Ye · Yan Luximon · Pengfei Wan · Di ZHANG

[ ExHall D ]

Abstract
Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles. However, existing methods are constrained to basic material types with limited predictable parameters, making them insufficient to represent the complexity of real-world materials. We introduce a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation. Our method utilizes multi-modal models to identify material types and initialize material parameters through image queries, while simultaneously inferring 3D Gaussian splats for detailed scene representation. We further refine these material parameters using video diffusion with a differentiable Material Point Method (MPM) and optical flow guidance rather than render loss or Score Distillation Sampling (SDS) loss. This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios, advancing both accuracy and flexibility in physics-based simulations.
Poster
Briac Toussaint · Diego Thomas · Jean-Sébastien Franco

[ ExHall D ]

Abstract
SDF-based differential rendering frameworks have achieved state-of-the-art multiview 3D shape reconstruction. In this work, we re-examine this family of approaches by minimally reformulating its core appearance model in a way that simultaneously yields faster computation and increased performance. To this goal, we exhibit a physically-inspired minimal radiance parametrization decoupling angular and spatial contributions, by encoding them with a small number of features stored in two respective volumetric grids of different resolutions. Requiring as little as four parameters per voxel, and a tiny MLP call inside a single fully fused kernel, our approach allows to enhance performance with both surface and image (PSNR) metrics, while providing a significant training speedup and real-time rendering. We show this performance to be consistently achieved on real data over two widely different and popular application fields, generic object and human subject shape reconstruction, using four representative and challenging datasets.
Poster
Zhiyuan Ma · Xinyue Liang · Rongyuan Wu · Xiangyu Zhu · Zhen Lei · Lei Zhang

[ ExHall D ]

Abstract
It is desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), which eliminates the need for 3D ground-truths by distilling multi-view diffusion models, and adapts SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries to the 3D outputs through score distillation. Our PRD scheme also accelerates the inference speed by training the model to generate 3D contents in just four steps. We use PRD to train a Triplane generator, namely TriplaneTurbo, which adds only 2.5% trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators …
Poster
Fangyu Wu · Yuhao Chen

[ ExHall D ]

Abstract
In the real world, objects reveal internal textures when sliced or cut, yet this behavior is not well-studied in 3D generation tasks today. For example, slicing a virtual 3D watermelon should reveal flesh and seeds. Given that no available dataset captures an object's full internal structure and collecting data from all slices is impractical, generative methods become the obvious approach. However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. Our approach produces objects via 3D Gaussian Splatting (3DGS) with both surface and interior textures synthesized, enabling real-time slicing and rendering without additional optimization. FruitNinja leverages a pre-trained diffusion model to progressively inpaint cross-sectional views and applies voxel-grid-based smoothing to achieve cohesive textures throughout the object. Our OpaqueAtom GS strategy overcomes 3DGS limitations by employing densely distributed opaque Gaussians, avoiding biases toward larger particles that destabilize training and sharp color transitions for fine-grained textures. Experimental results show that FruitNinja substantially outperforms existing approaches, showcasing unmatched visual quality in real-time rendered internal views across arbitrary geometry manipulations.
Poster
Wang Zhao · Yan-Pei Cao · Jiale Xu · Yue-Jiang Dong · Ying Shan

[ ExHall D ]

Abstract
Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.
Poster
Chen Cheng · Jiacheng Wei · Tianrun Chen · Chi Zhang · Xiaofeng Yang · Shangzhan Zhang · Bingchen Yang · Chuan-Sheng Foo · Guosheng Lin · Qixing Huang · Fayao Liu

[ ExHall D ]

Abstract
Creating CAD digital twins from the physical world is crucial for manufacturing, design, and simulation. However, current methods typically rely on costly 3D scanning with labor-intensive post-processing. To provide a streamlined and user-friendly design process, we explore the problem of reverse engineering from unconstrained real-world CAD images that can be easily captured by users of all experiences. However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image to parametric CAD model generation framework that trains a latent diffusion network solely on synthetic textureless CAD data while testing on real-world images. To bridge the significant representation disparity between images and parametric CAD models, we introduce a geometry encoder to improve the network's capability to accurately capture diverse geometric features. Moreover, the texture-invariant properties of the geometric features can also facilitate the generalization to real-world scenarios. Since compiling CAD parameter sequences into explicit CAD models is a non-differentiable process, the network training inherently lacks explicit geometric supervision. To impose geometric validity constraints on our model, we employ direct preference optimization to fine-tune the diffusion model with the automatic code checker feedback on CAD sequence quality. Furthermore, we collected a …
Poster
Jinnan Chen · Lingting Zhu · Zeyu HU · Shengju Qian · Yugang Chen · Xin Wang · Gim Hee Lee

[ ExHall D ]

Abstract
Recent advances in auto-regressive transformers have revolutionized generative modeling across domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential prediction paradigms, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution. To address these limitations, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent token denoising. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficient up-scaling the latent token resolution. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling properties over joint distribution modeling approaches like diffusion transformers in 3D generation.
Poster
Haohan Weng · Zibo Zhao · Biwen Lei · Xianghui Yang · Jian Liu · Zeqiang Lai · Zhuo Chen · Liu Yuhong · Jie Jiang · Chunchao Guo · Tong Zhang · Shenghua Gao · C.L.Philip Chen

[ ExHall D ]

Abstract
We propose a compressive yet effective mesh tokenization, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75% compared to the vanilla sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.
Poster
Qitong Yang · Mingtao Feng · Zijie Wu · Weisheng Dong · Fangfang Wu · Yaonan Wang · Ajmal Mian

[ ExHall D ]

Abstract
3D content creation has achieved significant progress in terms of both quality and speed. Although current Gaussian Splatting-based methods can produce 3D objects within seconds, they are still limited by complex preprocessing or low controllability. In this paper, we introduce a novel framework designed to efficiently and controllably generate high-resolution 3D models from text promptsor image. Our key insights are three-fold: 1) Hierarchical Gaussian Mixture Model Splatting: We propose an hybrid hierarchical representation to extract fixed number of fine-grained Gaussians with multiscale details from textured object, also establish part-level representation of Gaussians primitives. 2) Mamba with adaptive tree topology: We present a diffusion mamba with tree-topology to adaptively generate Gaussians with disordered spatial structures, without the need for complex preprocessing and maintain linear complexity generation. 3) Controllable Generation: Building on the HGMM tree, we introduce a cascaded diffusion framework combining controllable implicit latent generation, which progressively generates condition-driven latents, and explicit splatting generation, which transforms latents into high-quality Gaussian primitives. Extensive experiments demonstrate the high fidelity and efficiency of our approach.
Poster
SeonHwa Kim · Jiwon Kim · Soobin Park · Donghoon Ahn · Jiwon Kang · Seungryong Kim · Kyong Hwan Jin · Eunju Cha

[ ExHall D ]

Abstract
Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.
Poster
Martin Spitznagel · Jan Vaillant · Janis Keuper

[ ExHall D ]

Abstract
The image-to-image translation abilities of generative learning models have recently made significant progress in the estimation of complex (steered) mappings between image distributions. While appearance based tasks like image in-painting or style transfer have been studied at length, we propose to investigate the potential of generative models in the context of physical simulations. Providing a dataset of 300k image-pairs and baseline evaluations for three different physical simulation tasks, we propose a benchmark to investigate the following research questions: i) are generative models able to learn complex physical relations from input-output image pairs? ii) what speedups can be achieved by replacing differential equation based simulations? While baseline evaluations of different current models show the potential for high speedups (ii), these results also show strong limitations toward the physical correctness (i). This underlines the need for new methods to enforce physical correctness.
Poster
Dong In Lee · Hyeongcheol Park · Jiyoung Seo · Eunbyung Park · Hyunje Park · Ha Dam Baek · Shin sangheon · sangmin kim · Sangpil Kim

[ ExHall D ]

Abstract
Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose EditSplat, a novel 3D editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric properties of 3DGS. Additionally, our AGT leverages the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local edits. Through extensive qualitative and quantitative evaluations, EditSplat achieves superior multi-view consistency and editing quality over existing methods, significantly enhancing overall efficiency.
Poster
Youyu Chen · Junjun Jiang · Kui Jiang · Xiao Tang · Zhihao Li · Xianming Liu · Yinyu Nie

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Specifically, we formulate 3DGS optimization as progressively fitting 3DGS to higher levels of frequency components in the training views, and propose a dynamic rendering resolution scheme that largely reduces the optimization complexity based on this formulation. Besides, we argue that a specific rendering resolution should cooperate with a proper primitive number for a better balance between computing redundancy and fitting quality, where we schedule the growth of the primitives to synchronize with the rendering resolution. Extensive experiments show that our method accelerates the optimization of various 3DGS backbones by 45.7% on average while preserving the rendering quality.
Poster
Zhenqi Dai · Ting Liu · Yanning Zhang

[ ExHall D ]

Abstract
Efficient 3D scene representation has become a key challenge with the rise of 3D Gaussian Splatting (3DGS), particularly when incorporating semantic information into the scene representation. Existing 3DGS-based methods embed both color and high-dimensional semantic features into a single field, leading to significant storage and computational overhead. To mitigate this, we propose Decoupled Feature 3D Gaussian Splatting (DF-3DGS), a novel method that decouples the color and semantic fields, thereby reducing the number of 3D Gaussians required for semantic representation. We then introduce a hierarchical compression strategy that first employs our novel quantization approach with dynamic codebook evolution to reduce data size, followed by a scene-specific autoencoder for further compression of the semantic feature dimensions. This multi-stage approach results in a compact representation that enhances both storage efficiency and reconstruction speed. Experimental results demonstrate that DF-3DGS outperforms previous 3DGS-based methods, achieving faster training and rendering times while requiring less storage, without sacrificing performance—in fact, it improves performance in the novel view semantic segmentation task. Specifically, DF-3DGS achieves remarkable improvements over Feature 3DGS, reducing training time by 10× and storage by 20×, while improving the mIoU of novel view semantic segmentation by 4\%. The code will be publicly available.
Poster
Jiahui Zhang · Fangneng Zhan · Ling Shao · Shijian Lu

[ ExHall D ]

Abstract
Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality – large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.
Poster
Yuanjian Qiao · Mingwen Shao · Lingzhuang Meng · Kai Xu

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has recently achieved remarkable progress in novel view synthesis. However, existing methods rely heavily on high-quality data for rendering and struggle to handle degraded scenes with multi-view inconsistency, leading to inferior rendering quality. To address this challenge, we propose a novel Depth-aware \textbf{G}aussian \textbf{S}platting for efficient 3D scene \textbf{Restor}ation, called \textbf{RestorGS}, which flexibly restores multiple degraded scenes using a unified framework. Specifically, RestorGS consists of two core designs: Appearance Decoupling and Depth-Guided Modeling. The former exploits appearance learning over spherical harmonics to decouple clear and degraded Gaussian, thus separating the clear views from the degraded ones. Collaboratively, the latter leverages the depth information to guide the degradation modeling, thereby facilitating the decoupling process. Benefiting from the above optimization strategy, our method achieves high-quality restoration while enabling real-time rendering speed. Extensive experiments show that our RestorGS outperforms existing methods significantly in underwater, nighttime, and hazy scenes.
Poster
Yufan Zhang · Yu Ji · Yu Guo · Jinwei Ye

[ ExHall D ]

Abstract
We present a snapshot imaging technique for recovering 3D surrounding views of miniature scenes. Due to their intricacy, miniature scenes with objects sized in millimeters are difficult to reconstruct, yet miniatures are common in life and their 3D digitalization is desirable. We design a catadioptric imaging system with a single camera and eight pairs of planar mirrors for snapshot 3D reconstruction from a dollhouse perspective. We place paired mirrors on nested pyramid surfaces for capturing surrounding multi-view images in a single shot. Our mirror design is customizable based on the size of the scene for optimized view coverage. We use the 3D Gaussian Splatting (3DGS) representation for scene reconstruction and novel view synthesis. We overcome the challenge posed by our sparse view input by integrating visual hull-derived depth constraint. Our method demonstrates state-of-the-art performance on a variety of synthetic and real miniature scenes.
Poster
Long Ma · Yuxin Feng · Yan Zhang · Jinyuan Liu · Weimin Wang · Guang-Yong Chen · Chengpei Xu · Zhuo Su

[ ExHall D ]

Abstract
Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges in a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach's superiority and effectiveness. The code will be made publicly available upon acceptance of this work.
Poster
Yutong Liu · Wenming Weng · Yueyi Zhang · Zhiwei Xiong

[ ExHall D ]

Abstract
For the first time to our knowledge, S2D-LFE enables arbitrary novel view synthesis only from sparse-view light field event (LFE) data, and addresses three critical challenges for the LFE generation task: simplicity, controllability, and consistency. The simplicity aspect eliminates the dependency on frame-based modality, which often suffers from motion blur and low frame-rate limitations. The controllability aspect enables precise view synthesis under sparse LFE conditions with view-related constraints. The consistency aspect ensures both cross-view and temporal coherence in the generated results. To realize S2D-LFE, we develop a novel diffusion-based generation network with two key components. First, we design an LFE-customized variational auto-encoder that effectively compresses and reconstructs LFE by integrating cross-view information. Second, we design an LFE-aware injection adaptor to extract comprehensive geometric and texture priors. Furthermore, we construct a large-scale synthetic LFE dataset containing 162 one-minute sequences using simulator, and capture a real-world testset using our custom-built sparse LFE acquisition system, covering diverse indoor and outdoor scenes. Extensive experiments demonstrate that S2D-LFE successfully generates up to 9×9 dense LFE from 2×2 sparse inputs and outperforms existing methods on both synthetic and real-world data.
Poster
Li Fang · Hao Zhu · Longlong Chen · Fei Hu · Long Ye · Zhan Ma

[ ExHall D ]

Abstract
Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Observing the piecewise smooth nature of natural scenes, we find that sampling all rays is redundant for novel view synthesis. Inspired by plenoptic sampling theory, we propose a bundle sampling strategy. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. For regions with high-frequency content, such as edges and depth discontinuities, more samples along depth are used to capture finer details. To further optimize efficiency, we introduce a depth-guided adaptive sampling strategy, which dynamically allocates samples based on depth confidence—concentrating more samples in complex regions and reducing them in smoother areas. This dual approach significantly accelerates rendering. Applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to 2× faster rendering compared to existing generalizable methods. Code and trained models will be released upon acceptance.
Poster
Chin-Yang Lin · Chung-Ho Wu · Changhan Yeh · Shih Han Yen · Cheng Sun · Yu-Lun Liu

[ ExHall D ]

Abstract
Neural Radiance Fields (NeRF) face significant challenges in extreme few-shot scenarios, primarily due to overfitting and long training times. Existing methods, such as FreeNeRF and SparseNeRF, use frequency regularization or pre-trained priors but struggle with complex scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation scheme that selects pseudo ground truth depth based on reprojection errors across scales. This guides training without relying on externally learned priors, enabling full utilization of the training data. It can also integrate pre-trained priors, enhancing quality without slowing convergence. Experiments on LLFF, DTU, and RealEstate-10K show that FrugalNeRF outperforms other few-shot NeRF methods while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction.
Poster
Ankit Dhiman · Manan Shah · R. Venkatesh Babu

[ ExHall D ]

Abstract
Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synthetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror.Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to train a conditional generative model, aimed at improving real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach, and the code …
Poster
Yuanxun Lu · Jingyang Zhang · Tian Fang · Jean-Daniel Nahmias · Yanghai Tsin · Long Quan · Xun Cao · Yao Yao · Shiwei Li

[ ExHall D ]

Abstract
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data.Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation.
Poster
Guibiao Liao · Qing Li · Zhenyu Bao · Guoping Qiu · KANGLIN LIU

[ ExHall D ]

Abstract
3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3\% improvement in mIoU for open-world semantic segmentation.
Poster
Hyunho Ha · Lei Xiao · Christian Richardt · Thu Nguyen-Phuoc · Changil Kim · Min H. Kim · Douglas Lanman · Numair Khan

[ ExHall D ]

Abstract
We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields (TSDF) in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.
Poster
Sheng Miao · Jiaxin Huang · Dongfeng Bai · Xu Yan · Hongyu Zhou · Yue Wang · Bingbing Liu · Andreas Geiger · Yiyi Liao

[ ExHall D ]

Abstract
Novel view synthesis of urban scenes is essential for autonomous driving-related applications. Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.
Poster
Yuhan Wang · Fangzhou Hong · Shuai Yang · Liming Jiang · Wayne Wu · Chen Change Loy

[ ExHall D ]

Abstract
Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called **mesh attention** to enable training at 10242 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, **MEAT**. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods. Code and model will be publicly available.
Poster
Jiang Wu · Rui Li · Yu Zhu · Rong Guo · Jinqiu Sun · Yanning Zhang

[ ExHall D ]

Abstract
We present a Gaussian Splatting method for surface reconstruction using sparse input views. Previous methods relying on dense views struggle with extremely sparse Structure-from-Motion points for initialization. While learning-based Multi-view Stereo (MVS) provides dense 3D points, directly combining it with Gaussian Splatting leads to suboptimal results due to the ill-posed nature of sparse-view geometric optimization. We propose Sparse2DGS, an MVS-initialized Gaussian Splatting pipeline for complete and accurate reconstruction. Our key insight is to incorporate the geometric-prioritized enhancement schemes, allowing for direct and robust geometric learning under ill-posed conditions. As the first method of this kind, Sparse2DGS outperforms existing methods by notable margins, with 1.13 Chamfer Distance error compared to 2DGS (2.81) on the DTU dataset using 3 views. Meanwhile, our method is 2× faster than NeRF-based fine-tuning approach.
Poster
Wenyuan Zhang · Emily Yue-ting Jia · Junsheng Zhou · Baorui Ma · Kanle Shi · Yu-Shen Liu · Zhizhong Han

[ ExHall D ]

Abstract
Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks. The source code will be publicly available.
Poster
Mingjun Zheng · Long Sun · Jiangxin Dong · Jinshan Pan

[ ExHall D ]

Abstract
Latency is a key driver for real-time rendering applications, making super-resolution techniques increasingly popular to accelerate rendering processes. In contrast to existing methods that directly concatenate low-resolution frames and G-buffers as input without discrimination, we develop an asymmetric UNet-based super-resolution network with decoupled G-buffer guidance, dubbed \textbf{RDG}, to facilitate the spatial and temporal feature exploration for minimizing performance overheads and latency. We first propose a dynamic feature modulator (DFM) to selectively encode the spatial information for capturing a precise structural information. We then incorporate auxiliary G-buffer information to guide the decoder to generate detail-rich, temporally stable results. Specifically, we adopt a high-frequency feature booster (HFB) to adaptively transfer the high-frequency information from the normal and bidirectional reflectance distribution function (BRDF) components of the G-buffer, enhancing the details of the generated results. To further enhance the temporal stability, we design a cross-frame temporal refiner (CTR) with depth and motion vector constraints to aggregate the previous and current frames. Extensive experimental results reveal that our proposed method is capable of generating high-quality and temporally stable results in real-time rendering. The proposed RDG-s produces \textbf{1080P} rendering results on a RTX 3090 GPU with a speed of \textbf{126 FPS}.
Poster
Sangwoon Kwak · Joonsoo Kim · Jun Young Jeong · Won-Sik Cheong · Jihyong Oh · Munchurl Kim

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts to adapt it for dynamic scenes. While achieving high rendering quality and speed, the existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDec-GS, a memory-efficient Gaussian splatting framework for reconstructing novel views in challenging scenarios with complex motions. We introduce Global-to-Local Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarse-to-fine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDec-GS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDec-GS achieves an average 70% reduction in model size over state-of-the-art methods …
Poster
Yuheng Jiang · Zhehao Shen · Chengcheng Guo · Yu Hong · Zhuo Su · Yingliang Zhang · Marc Habermann · Lan Xu

[ ExHall D ]

Abstract
Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods either focus on replaying general dynamic scenes or animating human avatars under new motions, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.
Poster
Miaowei Wang · Yibo Zhang · Rui Ma · Weiwei Xu · Changqing Zou · Daniel Morris

[ ExHall D ]

Abstract
We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for significant positional changes without being constrained by the initial contacted surface. Recognizing the limitations of current 2D inpainting tools for restoring 3D locations, our approach uses joint Poisson fields to repair and expand the Gaussians of both objects and contacted scenes after separation. This is complemented by a multi-carve strategy to refine the object's geometry. Our system enables realistic simulations of decoupling motions, collisions, and fractures driven by user-specified impulses, supporting complex interactions within and across multiple scenes. We validate DecoupledGaussian through a comprehensive user study and quantitative benchmarks. This system enhances digital interaction with objects and scenes in real-world environments, benefiting industries such as VR, robotics, and autonomous driving. The code is available in Supplementary and will be released publicly upon acceptance.
Poster
Navami Kairanda · Marc Habermann · Shanthika Shankar Naik · Christian Theobalt · Vladislav Golyanik

[ ExHall D ]

Abstract
3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose Thin-Shell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell simulation based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-by-synthesis principles. Our Thin-Shell-SfT method outperforms prior work qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and joint space-time optimisation.
Poster
Xinjie Li · Ziyi Chen · Xinlu Yu · Iek-Heng Chu · Peng Chang · Jing Xiao

[ ExHall D ]

Abstract
Co-speech gestures are essential to non-verbal communication, enhancing both the naturalness and effectiveness of human interaction. Although recent methods have made progress in generating co-speech gesture videos, many rely on strong visual controls, such as pose images or TPS keypoint movements, which often lead to artifacts like blurry hands and distorted fingers. In response to these challenges, we present the Implicit Motion-Audio Entanglement (IMAE) method for co-speech gesture video generation. IMAE strengthens audio control by entangling implicit motion parameters, including pose and expression, with audio inputs. Our method utilizes a two-branch framework that combines an audio-to-motion generation branch with a video diffusion branch, enabling realistic gesture generation without requiring additional inputs during inference. To improve training efficiency, we propose a two-stage slow-fast training strategy that balances memory constraints while facilitating the learning of meaningful gestures from long frame sequences.Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple metrics.
Poster
Natacha Kuete Meli · Vladislav Golyanik · Marcel Seelbach Benkner · Michael Moeller

[ ExHall D ]

Abstract
There is growing interest in solving computer vision problems such as mesh or point set alignment using Adiabatic Quantum Computing (AQC). Unfortunately, modern experimental AQC devices such as D-Wave only support Quadratic Unconstrained Binary Optimization (QUBO) problems, which severely limits their applicability. This paper proposes a new way to overcome this limitation and introduces QuCOOP, an optimization framework extending the scope of AQC to composite and binary-parameterized, possibly non-quadratic problems. The key idea of QuCOOP is to iteratively approximate the original objective function by a sequel of local (intermediate) QUBO forms, whose binary parameters can be sampled on AQC devices. We experiment with quadratic assignment problems, shape matching and point set registration without knowing the correspondences in advance. Our approach achieves state-of-the-art results across multiple instances of tested problems.
Poster
Shashwath Bharadwaj · Ruangrawee Kitichotkul · Akshay Agarwal · Vivek K Goyal

[ ExHall D ]

Abstract
Readout multiplexing is a promising solution to overcome hardware limitations and data bottlenecks in imaging with single-photon detectors. Conventional multiplexed readout processing creates an upper bound on photon counts at a very fine time scale, where measurements with multiple detected photons must either be discarded or allowed to introduce significant bias. We formulate multiphoton coincidence resolution as an inverse imaging problem and introduce a solution framework to probabilistically resolve the spatial locations of photon incidences. Specifically, we develop a theoretical abstraction of row--column multiplexing and a model of photon events that make readouts ambiguous. Using this, we propose a novel estimator that spatially resolves up to four coincident photons. Our estimator achieves a 3 to 4 dB increase in the peak signal-to-noise ratio of image reconstruction compared to traditional methods at higher incidence photon fluxes. Additionally, this method achieves a ~4 reduction in the required number of readout frames to achieve the same mean-squared error as other methods. Finally, our solution matches the Cramer-Rao bound for detection probability estimation for a wider range of incident flux values compared to conventional methods. While demonstrated for a specific detector type and readout architecture, this method can be extended to more general multiplexing …
Poster
Yuanlin Wang · Yiyang Zhang · Ruiqin Xiong · Jing Zhao · Jian Zhang · Xiaopeng Fan · Tiejun Huang

[ ExHall D ]

Abstract
Spike camera is a kind of neuromorphic camera that records dynamic scenes by firing a stream of binary spikes with extremely high temporal resolution. It demonstrates great potential for vision tasks in high-speed scenarios. One limitation in its current implementation is the relatively low spatial resolution. This paper develops a network called Spk2SRImgNet to super-resolve high resolution images from low resolution spike stream. However, fluctuations in spike stream hinder the performance of spike camera super resolution. To address this issue, we propose a motion aligned collaborative filtering (MACF) module, which is motivated by key ideas in classic image restoration schemes to mitigate fluctuations in spike data. MACF leverages the temporal similarity of spike stream to acquire similar features from neighboring moments via motion alignment. To separate disturbances from features, MACF filters these similar features jointly in transform domain to exploit representation sparsity, and generates refinement features that will be used to update initial fluctuated features. Specifically, MACF designs an inverse motion alignment operation to map these refinement features back to their original positions. The initial features are aggregated with the repositioned refinement features to enhance reliability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods. …
Poster
Bohan Yu · Jin Han · Boxin Shi · Imari Sato

[ ExHall D ]

Abstract
Simultaneously acquisition of the surface normal and reflectance parameters is a crucial but challenging technique in the field of computer vision and graphics. It requires capturing multiple high dynamic range (HDR) images in existing methods using frame-based cameras. In this paper, we propose EventPSR, the first work to recover surface normal and reflectance parameters (e.g., metallic and roughness) simultaneously using an event camera. Compared with the existing methods based on photometric stereo or neural radiance fields, EventPSR is a robust and efficient approach that works consistently with different materials. Thanks to the extremely high temporal resolution and high dynamic range coverage of event cameras, EventPSR can recover accurate surface normal and reflectance of objects with various materials in 10 seconds. Extensive experiments on both synthetic data and real objects show that compared with existing methods using more than 100 HDR images, EventPSR recovers comparable surface normal and reflectance parameters with only about 30% of the data rate.
Poster
Cheng Zhang · Haofei Xu · Qianyi Wu · Camilo Cruz Gambardella · Dinh Phung · Jianfei Cai

[ ExHall D ]

Abstract
With the advent of portable 360° cameras, panorama has gained significant attention in applications like virtual reality (VR), virtual tours, robotics, and autonomous driving. As a result, wide-baseline panorama view synthesis has emerged as a vital task, where high resolution, fast inference, and memory efficiency are essential. Nevertheless, existing methods typically focus on lower resolutions (512×1024) due to demanding memory and computational requirements. In this paper, we present PanSplat, a generalizable, feed-forward approach that efficiently supports resolution up to 4K (2048×4096). Our approach features a tailored spherical 3D Gaussian pyramid with a Fibonacci lattice arrangement, enhancing image quality while reducing information redundancy. To accommodate the demands of high resolution, we propose a pipeline that integrates a hierarchical spherical cost volume and localized Gaussian heads, enabling two-step deferred backpropagation for memory-efficient training on a single A100 GPU. Experiments demonstrate that PanSplat achieves state-of-the-art results with superior efficiency and image quality across both synthetic and real-world datasets.
Poster
Xuan Shen · Weize Ma · Jing Liu · Changdi Yang · Rui Ding · Quanyi Wang · Henghui Ding · Wei Niu · Yanzhi Wang · Pu Zhao · Jun Lin · Jiuxiang Gu

[ ExHall D ]

Abstract
Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications.However, deploying high-performing depth estimation models on resource-constrained edge devices, especially Application-Specific Integrated Circuits (ASICs), remains a formidable challenge due to the substantial computational and memory demands of state-of-the-art models. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose **QuartDepth** which adopts post-training quantization to optimize and accelerate MDE models specifically for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, significantly reducing the model size and computation cost. To mitigate the performance degradation typically associated with aggressive quantization, we introduce an activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization.Furthermore, we design a novel flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency.Experimental results demonstrate that our proposed framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability.
Poster
Jianhao Zheng · Zihan Zhu · Valentin Bieri · Marc Pollefeys · Songyou Peng · Iro Armeni

[ ExHall D ]

Abstract
We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping.This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.
Poster
Qianqian Wang · Yifei Zhang · Aleksander Holynski · Alexei A. Efros · Angjoo Kanazawa

[ ExHall D ]

Abstract
We propose a novel unified framework capable of solving a broad range of 3D tasks. At the core of our approach is an online stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, our method leverages the evolving state to generate metric-scale pointmaps for each input in an online manner. These pointmaps reside within a common coordinate system, accumulating into a coherent 3D scene reconstruction. Our model captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen structures beyond the coverage of the input images through a raymap probe. Our method is simple yet highly flexible, naturally accepting varying lengths of image sequences and working seamlessly with both video streams and unordered photo collections. We evaluate our method on various 3D/4D tasks including monocular/video depth estimation, camera estimation, multi-view reconstruction, and achieve competitive or state-of-the-art performance. Additionally, we showcase intriguing behaviors enabled by our state representation.
Poster
Zhengqi Li · Richard Tucker · Forrester Cole · Qianqian Wang · Linyi Jin · Vickie Ye · Angjoo Kanazawa · Aleksander Holynski · Noah Snavely

[ ExHall D ]

Abstract
We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of the deep visual SLAM framework, and with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times.
Poster
Hoang Chuong Nguyen · Wei Mao · Jose M. Alvarez · Miaomiao Liu

[ ExHall D ]

Abstract
Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods.
Poster
Jinneyong Kim · Seung-Hwan Baek

[ ExHall D ]

Abstract
Integrating RGB and NIR imaging provides complementary spectral information, enhancing robotic vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream tasks.In this paper, we develop a robotic vision system equipped with two pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures RGB stereo images, NIR stereo images, and temporally synchronized LiDAR point cloud. Utilizing the mobility of the robot, we present a dataset containing continuous video frames with pixel-aligned RGB and NIR stereo pairs under diverse lighting conditions.We introduce two methods that utilize our pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information.Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.
Poster
Sergio Izquierdo · Mohamed Sayed · Michael Firman · Guillermo Garcia-Hernando · Daniyar Turmukhambetov · Javier Civera · Oisin Mac Aodha · Gabriel Brostow · Jamie Watson

[ ExHall D ]

Abstract
Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision.However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges.MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.
Poster
Yaqing Ding · Viktor Kocur · Zuzana Berger Haladova · Qianliang Wu · Shen Cai · Jian Yang · Zuzana Kukelova

[ ExHall D ]

Abstract
In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers.
Poster
Ji Zhao · Banglei Guan · Zibin Liu · Laurent Kneip

[ ExHall D ]

Abstract
For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code will be made publicly available.
Poster
Haifeng Wu · Shuhang Gu · Lixin Duan · Wen Li

[ ExHall D ]

Abstract
Self-supervised monocular depth estimation has long been treated as a point-wise prediction problem, where the depth of each pixel is usually estimated independently. However, artifacts are often observed in the estimated depth map, e.g., depth values for points located in the same region may jump dramatically. To address this issue, we propose a novel self-supervised monocular depth estimation framework called GeoDepth, where we explore the intrinsic geometric representation in 3D scene for producing accurate and continuous depth map. In particularity, we model the complex 3D scene as a collection of planes with varying sizes, where each plane is characterized by a unique set of parameters, namely planar normal (indicating plane orientation) and planar offset (defining the perpendicular distance from the camera center to the plane). Under this modeling, points in a same plane are enforced to share a unique representation and their depth variations related only to pixel coordinates, thus this geometric relationship can be exploited to regularize the depth variations of these points. To this end, we design a structured plane generation module that introduce temporal-spatial geometric cues and the plane uniqueness principle to recover the correct scene plane representation. In addition, we develop a depth discontinuity module to …
Poster
Xudong Jiang · Fangjinhua Wang · Silvano Galliani · Christoph Vogel · Marc Pollefeys

[ ExHall D ]

Abstract
Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10× more accurate than previous SCR methods with similar map size and require at least 5× smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available upon acceptance.
Poster
Ron Ferens · Yosi Keller

[ ExHall D ]

Abstract
In this work, we propose HyperPose, which utilizes hypernetworks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. This disparity degrades the precision of contemporary localization networks. To mitigate this, we advocate for incorporating hypernetworks into single-scene and multiscene camera pose regression models. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image, effectively narrowing the domain gap. Using indoor and outdoor datasets, we evaluate the HyperPose methodology across multiple established absolute pose regression architectures. In particular, we introduce and share the Extended Cambridge Landmarks (ECL), which is a novel localization dataset, based on the Cambridge Landmarks dataset, showing it in multiple seasons with significantly varying appearance conditions. Our empirical experiments demonstrate that HyperPose yields notable performance enhancements for both single- and multi-scene architectures. We have made our source code, pre-trained models, and ECL dataset openly available.
Poster
Nicole Damblon · Marc Pollefeys · Daniel Barath

[ ExHall D ]

Abstract
This paper introduces a novel approach to improve camera position estimation in global Structure-from-Motion (SfM) frameworks by filtering inaccurate pose graph edges, representing relative translation estimates, before applying translation averaging. In SfM, pose graph vertices represent cameras and edges relative poses (rotation and translation) between cameras. We formulate the edge filtering problem as a vertex filtering in the dual graph - a line graph where the vertices stem from edges in the original graph, and the edges from cameras. Exploiting such a representation, we frame the problem as a binary classification over nodes in the dual graph. To learn such a classification and find outlier edges, we employ a Transformer architecture-based technique. To address the challenge of memory overflow often caused by converting to a line graph, we introduce a clustering-based graph processing approach, enabling the application of our method to arbitrarily large pose graphs. The proposed method outperforms existing relative translation filtering techniques in terms of final camera position accuracy and can be seamlessly integrated with any other filter. The code will be made public.
Poster
Linyi Jin · Richard Tucker · Zhengqi Li · David Fouhey · Noah Snavely · Aleksander Holynski

[ ExHall D ]

Abstract
Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3r to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes.
Poster
Max Kahl · Sebastian Stricker · Lisa Hutschenreiter · Florian Bernard · Carsten Rother · Bogdan Savchynskyy

[ ExHall D ]

Abstract
Multi-graph matching is an important problem in computer vision. Our task comes from bioimaging, where a set of 29 3D-microscopic images of worms have to be brought into correspondence. Surprisingly, virtually all existing methods are not applicable to this large-scale, real-world problem since they either assume a complete or dense problem setting, and they have so far only been applied to small-scale, toy or synthetic problems. Despite claims in literature that methods addressing complete multi-graph matching are applicable in an incomplete setting, our first contribution is to prove that their runtime would be excessive and impractical. Our second contribution is a new method for incomplete multi-graph matching that applies to real-world, larger-scale problems.We experimentally show that for our bioimaging application we are able to attain results in less than two minutes, whereas the only competing approach requires at least half an hour while producing far worse results. Furthermore, even for small-scale, dense or complete problem instances we achieve results that are at least on par with the leading methods, but an order of magnitude faster.
Poster
Qiyang Qian · Hansheng Chen · Masayoshi Tomizuka · Kurt Keutzer · Qianqian Wang · Chenfeng Xu

[ ExHall D ]

Abstract
Finding semantic correspondences between images is a challenging problem in computer vision, particularly under significant viewpoint changes. Previous methods rely on semantic features from pre-trained 2D models like Stable Diffusion and DINOv2, which often struggle to extract viewpoint-invariant features. To overcome this, we propose a novel approach that integrates geometric and semantic reasoning. Unlike prior methods relying on heuristic geometric enhancements, our framework fine-tunes DUSt3R on synthetic cross-instance data to reconstruct distinct objects in an aligned 3D space. By learning to deform these objects into similar shapes using semantic supervision, we enable efficient KNN-based geometric matching, followed by sparse semantic matching within local KNN candidates. While trained on synthetic data, our method generalizes effectively to real-world images, achieving up to 7.4-point improvements in zero-shot settings on the rigid-body subset of Spair-71K and up to 19.6-point gains under extreme viewpoint variations. Additionally, it accelerates runtime by up to 40 times, demonstrating both its robustness to viewpoint changes and its efficiency for practical applications.
Poster
Aviral Chharia · Wenbo Gou · Haoye Dong

[ ExHall D ]

Abstract
Though single-view 3D human pose estimation has gained much attention, 3D multi-view multi-person pose estimation faces several challenges including the presence of occlusions and generalizability to new camera arrangements or scenarios. Existing transformer-based approaches often struggle to accurately model joint spatial sequences, especially in occluded scenarios. To address this, we present a novel Multi-View State Space Modeling framework, named MV-SSM for robustly reconstructing 3D human poses, by explicitly modeling the joint spatial sequence at two distinct levels: the feature level from multi-view images and the joint level of the person. Specifically, we propose a Projective State Space (PSS) block to learn the joint spatial sequences using state space modeling. Furthermore, we modify Mamba's unidirectional scanning into an effective Grid token-guided Bidirectional scan (GTBS) which is integral to the PSS block. Experiments on multiple challenging benchmarks demonstrate that MV-SSM archives highly accurate 3D pose estimation and is generalizable across the number of cameras (+10.8 on AP25 on the challenging 3 camera setting in CMU Panoptic), varying camera arrangements (+7.0 on AP25), and cross-datasets (+15.3 PCP on Campus A1), significantly outperforming SOTAs. The code has been submitted and will be open-sourced with model weights upon acceptance.
Poster
Chamuditha Jayanga Galappaththige · Jason Lai · Lloyd Windrim · Donald G. Dansereau · Niko Suenderhauf · Dimity Miller

[ ExHall D ]

Abstract
Autonomous agents often require accurate methods for detecting and localizing changes in their environment, particularly when observations are captured from unconstrained and inconsistent viewpoints. We propose a novel label-free, pose-agnostic change detection method that integrates information from multiple viewpoints to construct a change-aware 3D Gaussian Splatting (3DGS) representation of the scene. With as few as 5 images of the post-change scene, our approach can learn additional change channels in a 3DGS and produce change masks that outperform single-view techniques. Our change-aware 3D scene representation additionally enables the generation of accurate change masks for unseen viewpoints. Experimental results demonstrate state-of-the-art performance in complex multi-object scenes, achieving a 1.7× and 1.6× improvement in Mean Intersection Over Union and F1 score respectively over other baselines. We also contribute a new real-world dataset to benchmark change detection in diverse challenging scenes in the presence of lighting variations. Our code and dataset will be made publicly available upon acceptance.
Poster
Yihan Chen · Wenfei Yang · Huan Ren · Shifeng Zhang · Tianzhu Zhang · Feng Wu

[ ExHall D ]

Abstract
Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7 reduction in mean angular error on the CO3D dataset.
Poster
Sungphill Moon · Hyeontae Son · Dongcheol Hur · Sangwook Kim

[ ExHall D ]

Abstract
We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.
Poster
Taeyeop Lee · Korea Advanced Institute of Science and Technology · Minjun Kang · Korea Advanced Institute of Science and Technology · Korea Advanced Institute of Science & Technology · Korea Advanced Institute of Science and Technology

[ ExHall D ]

Abstract
We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric size estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on four challenging datasets: REAL275, Toyota-Light, HO3D, and YCBINEOAT, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation.
Poster
Jingnan Shi · Rajat Talak · Harry Zhang · David Jin · Luca Carlone

[ ExHall D ]

Abstract
We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also …
Poster
Jingshun Huang · Haitao Lin · Tianyu Wang · Yanwei Fu · Xiangyang Xue · Yi Zhu

[ ExHall D ]

Abstract
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size.To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise …
Poster
Yizheng Xie · Viktoria Ehm · Paul Roetzer · Nafie El Amrani · Maolin Gao · Florian Bernard · Daniel Cremers

[ ExHall D ]

Abstract
Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics. While most research has focused on finding correspondences in settings where at least one of the shapes is complete, the realm of partial-to-partial shape matching remains under-explored. Yet it is of importance since, in many applications, shapes are only observed partially due to occlusion or scanning.Finding correspondences between partial shapes comes with an additional challenge: We not only want to identify correspondences between points on either shape but also have to determine which points of each shape actually have a partner.To tackle this challenging problem, we present EchoMatch, a novel framework for partial-to-partial shape matching that incorporates the concept of correspondence reflection to enable an overlap prediction within a functional map framework.With this approach, we show that we can outperform current SOTA methods in challenging partial-to-partial shape matching problems.
Poster
Sayak Nag · Udita Ghosh · Calvin-Khang Ta · Sarosij Bose · Jiachen Li · Amit K. Roy-Chowdhury

[ ExHall D ]

Abstract
Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.
Poster
Kyujin Shim · Kangwook Ko · YuJin Yang · Changick Kim

[ ExHall D ]

Abstract
Multi-object tracking (MOT) is a critical task in computer vision, requiring the accurate identification and continuous tracking of multiple objects across video frames. However, current state-of-the-art methods mainly rely on a global optimization technique and multi-stage cascade association strategy, and those approaches often overlook the specific characteristics of assignment task in MOT and useful detection results that may represent occluded objects. To address these challenges, we propose a novel Track-Focused Online Multi-Object Tracker (TrackTrack) with two key strategies: Track-Perspective-Based Association (TPA) and Track-Aware Initialization (TAI). The TPA strategy associates each track with the most suitable detection result by choosing the one with the minimum distance from all available detection results in a track-perspective manner. On the other hand, the TAI method precludes the generation of spurious tracks in the track-aware aspect by suppressing track initialization of detection results that heavily overlap with current active tracks and more confident detection results. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that our TrackTrack outperforms current state-of-the-art trackers, offering improved robustness and accuracy across diverse and challenging tracking scenarios.
Poster
Hyunseop Kim · Hyo-Jun Lee · Yonguk Lee · Jinu Lee · Hanul Kim · Yeong Jun Koh

[ ExHall D ]

Abstract
Recently, 3D multi-object tracking (MOT) has widely adopted the standard tracking-by-detection paradigm, which solves the association problem between detections and tracks. Many tracking-by-detection approaches establish constrained relationships between detections and tracks using a distance threshold to reduce confusion during association. However, this approach does not effectively and comprehensively utilize the information regarding objects due to the constraints of the distance threshold. In this paper, we propose GRAE-3DMOT, Geometry Relation-Aware Encoder 3D Multi-Object Tracking, which contains a geometric relation-aware encoder to produce informative features for association. The geometric relation-aware encoder consists of three components: a spatial relation-aware encoder, a spatiotemporal relation-aware encoder, and a distance-aware feature fusion layer. The spatial relation-aware encoder effectively aggregates detection features by comprehensively exploiting as many detections as possible. The spatiotemporal relation-aware encoder provides spatiotemporal relation-aware features by combing spatial and temporal relation features, where the spatiotemporal relation-aware features are transformed into association scores for MOT. The distance-aware feature fusion layer is integrated into both encoders to enhance the relation features of physically proximate objects. Experimental results demonstrate that the proposed GRAE-3DMOT outperforms the state-of-the-art on the nuScenes. Our approach achieves 73.7% and 70.2% AMOTA on the nuScenes validation and test sets using CenterPoint detections.
Poster
Weizhuo Li · Yue Xi · Wenjing Jia · zehao zhang · Fei Li · Xiangzeng Liu · Qiguang Miao

[ ExHall D ]

Abstract
Point-Supervised Object Detection (PSOD) in a discriminative style has recently gained significant attention for its impressive detection performance and cost-effectiveness. However, accurately predicting high-quality pseudo-box labels for drone-view images, which often feature densely packed small objects, remains a challenge. This difficulty arises primarily from the limitation of rigid sampling strategies, which hinder the optimization of pseudo-boxes. To address this, we propose PointSR, an effective and robust point-supervised object detection framework with self-regularized sampling that integrates temporal and informative constraints throughout the pseudo-box generation process. Specifically, the framework comprises three key components: Temporal-Ensembling Encoder (TE Encoder), Coarse Pseudo-box Prediction, and Pseudo-box Refinement. The TE Encoder builds an anchor prototype library by aggregating temporal information for dynamic anchor adjustment. In Coarse Pseudo-box Prediction, anchors are refined using the prototype library, and a set of informative samples is collected for subsequent refinement. During Pseudo-box Refinement, these informative negative samples are used to suppress low-confidence candidate positive samples, thereby improving the quality of the pseudo boxes. Experimental results on benchmark datasets demonstrate that PointSR significantly outperforms state-of-the-art methods, achieving up to 3.3%7.2% higher AP50 using only point supervision. Additionally, it exhibits strong robustness to perturbation in human-labeled points.
Poster
Sijie Wang · Rui She · Qiyu Kang · Siqi Li · Disheng Li · Tianyu Geng · Shangshu Yu · Wee Peng Tay

[ ExHall D ]

Abstract
Place recognition (PR) aims at retrieving the query place from a database and plays a crucial role in various applications, including navigation, autonomous driving, and augmented reality. While previous multi-modal PR works have mainly focused on the same-view scenario in which ground-view descriptors are matched with a database of ground-view descriptors during inference, the multi-modal cross-view scenario, in which ground-view descriptors are matched with aerial-view descriptors in a database, remains under-explored. We propose AGPlace, a model that effectively integrates information from multi-modal ground sensors (cameras and LiDARs) to achieve accurate aerial-ground PR. AGPlace achieves effective aerial-ground cross-view PR by leveraging a manifold-based neural ordinary differential equation (ODE) framework with a multi-domain alignment loss. It outperforms existing state-of-the-art cross-view PR models on large-scale datasets. As most existing PR models are designed for ground-ground PR, we adapt these baselines into our cross-view pipeline. Experiments demonstrate that this direct adaptation performs worse than our overall model architecture AGPlace. AGPlace represents a significant advancement in multi-modal aerial-ground PR, with promising implications for real-world applications.
Poster
Huan Lei

[ ExHall D ]

Abstract
Neural surface reconstruction has been dominated by implicit representations with marching cubes for explicit surface extraction. However, those methods typically require high-quality normals for accurate reconstruction. We propose OffsetOPT, a method that reconstructs explicit surfaces directly from 3D point clouds and eliminates the need for point normals. The approach comprises two stages: first, we train a neural network to predict surface triangles based on local point geometry, given isometrically distributed input points. Next, we apply the frozen network to reconstruct surfaces from unseen point clouds by optimizing a per-point offset to maximize the accuracy of triangle predictions. Compared to state-of-the-art methods, OffsetOPT not only excels at reconstructing overall surfaces but also significantly preserves sharp surface features. We demonstrate its accuracy on popular benchmarks, including small-scale shapes and large-scale open surfaces.
Poster
Chen Zhang · Wentao Wang · Ximeng Li · Xinyao Liao · Wanjuan Su · Wenbing Tao

[ ExHall D ]

Abstract
Recently, learning signed distance functions (SDFs) from point clouds has become popular for reconstruction. To ensure accuracy, most methods require using high-resolution Marching Cubes for surface extraction. However, this results in redundant mesh elements, making the mesh inconvenient to use. To solve the problem, we propose an adaptive meshing method to extract resolution-adaptive meshes based on surface curvature, enabling the recovery of high-fidelity lightweight meshes. Specifically, we first use point-based representation to perceive implicit surfaces and calculate surface curvature. A vertex generator is designed to produce curvature-adaptive vertices with any specified number on the implicit surface, preserving the overall structure and high-curvature features. Then we develop a Delaunay meshing algorithm to generate meshes from vertices, ensuring geometric fidelity and correct topology. In addition, to obtain accurate SDFs for adaptive meshing and achieve better lightweight reconstruction, we design a hybrid representation combining feature grid and feature tri-plane for better detail capture. Experiments demonstrate that our method can generate high-quality lightweight meshes from point clouds. Compared with methods from various categories, our approach achieves superior results, especially in capturing more details with fewer elements.
Poster
Zhaiyu Chen · Yuqing Wang · Liangliang Nan · Xiao Xiang Zhu

[ ExHall D ]

Abstract
Existing polygonal surface reconstruction methods heavily depend on input completeness and struggle with incomplete point clouds. We argue that while current point cloud completion techniques may recover missing points, they are not optimized for polygonal surface reconstruction, where the parametric representation of underlying surfaces remains overlooked. To address this gap, we introduce parametric completion, a novel paradigm for point cloud completion, which recovers parametric primitives instead of individual points to convey high-level geometric structures. Our presented approach, PaCo, enables high-quality polygonal surface reconstruction by leveraging plane proxies that encapsulate both plane parameters and inlier points, proving particularly effective in challenging scenarios with highly incomplete data. Comprehensive evaluation of our approach on the ABC dataset establishes its effectiveness with superior performance and sets a new standard for polygonal surface reconstruction from incomplete data.
Poster
Aocheng Li · James R. Zimmer-Dauphinee · Rajesh Kalyanam · Ian Lindsay · Parker VanValkenburgh · Steven Wernke · Daniel Aliaga

[ ExHall D ]

Abstract
Point cloud completion helps restore partial incomplete point clouds suffering occlusions. Current self-supervised methods fail to give high fidelity completion for large objects with missing surfaces and unbalanced distribution of available points. In this paper, we present a novel method for restoring large-scale point clouds with limited and imbalanced ground-truth. Using rough boundary annotations for a region of interest, we project the original point clouds into a multiple-center-of-projection (MCOP) image, where fragments are projected to images of 5 channels (RGB, depth, and rotation). Completion of the original point cloud is reduced to inpainting the missing pixels in the MCOP images. Due to lack of complete structures and an unbalanced distribution of existing parts, we develop a self-supervised scheme which learns to infill the MCOP image with points resembling existing "complete" patches. Special losses are applied to further enhance the regularity and consistency of completed MCOP images, which is mapped back to 3D to form final restoration. Extensive experiments demonstrate the superiority of our method in completing 600+ incomplete and unbalanced archaeological structures in Peru.
Poster
Kexue Fu · Ming'zhi Yuan · Changwei Wang · Weiguang Pang · Jing Chi · Manning Wang · Longxiang Gao

[ ExHall D ]

Abstract
Recently, coarse-to-fine methods for point cloud registration have achieved great success, but few works deeply explore the impact of feature interaction at both coarse and fine scales. By visualizing attention scores and correspondences, we find that existing methods fail to achieve effective feature aggregation at the two scales during the feature interaction. To tackle this issue, we propose a Dual Focus-Attention Transformer framework, which only focuses on points relevant to the current point for feature interaction, avoiding interactions with irrelevant points. For the coarse scale, we design a superpoint focus-attention transformer guided by sparse keypoints, which are selected from the neighborhood of superpoints. For the fine scale, we only perform feature interaction between the point sets that belong to the same superpoint. Experiments show that our method achieve the state-of-the-art performance on three standard benchmarks. The code and pre-trained models will be available at Github.
Poster
Changhao Peng

[ ExHall D ]

Abstract
Gaussian and Laplacian entropy models are proved effective in learned point cloud attribute compression, as they assist in arithmetic coding of latents. However, we demonstrate through experiments that there is still unutilized information in entropy parameters estimated by neural networks in current methods, which can be used for more accurate probability estimation. Thus we introduce generalized Gaussian entropy model, which controls the tail shape through shape parameter to more accurately estimate the probability of latents. Meanwhile, to the best of our knowledge, existing methods use fixed likelihood intervals for each integer during arithmetic coding, which limits model performance. We propose Mean Error Discriminator (MED) to determine whether the entropy parameter estimation is accurate and then dynamically adjust likelihood intervals. Experiments show that our method significantly improves rate-distortion (RD) performance on three VAE-based models for point cloud attribute compression, and our method can be applied to other compression tasks, such as image and video compression.
Poster
Yiran Wang · Jiaqi Li · Chaoyi Hong · Ruibo Li · Liusheng Sun · Xiao Song · Zhe Wang · Zhiguo Cao · Guosheng Lin

[ ExHall D ]

Abstract
Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.
Poster
Dekai Zhu · Yan Di · Stefan Gavranovic · Slobodan Ilic

[ ExHall D ]

Abstract
Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point cloud with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA, demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% …
Poster
Ali Bahri · Moslem Yazdanpanah · Mehrdad Noori · Sahar Dastani · Milad Cheraghalikhani · David OSOWIECHI · Gustavo Vargas Hakim · Farzad Beizaee · Ismail Ben Ayed · Christian Desrosiers

[ ExHall D ]

Abstract
State Space Models (SSMs) have shown significant promise in Natural Language Processing (NLP) and, more recently, computer vision. This paper introduces a new methodology leveraging Mamba and Masked Autoencoder (MAE) networks for point cloud data in both supervised and self-supervised learning. We propose three key contributions to enhance Mamba's capability in processing complex point cloud structures. First, we exploit the spectrum of a graph Laplacian to capture patch connectivity, defining an isometry-invariant traversal order that is robust to viewpoints and better captures shape manifolds than traditional 3D grid-based traversals. Second, we adapt segmentation via a recursive patch partitioning strategy informed by Laplacian spectral components, allowing finer integration and segment analysis. Third, we address token placement in MAE for Mamba by restoring tokens to their original positions, which preserves essential order and improves learning. Extensive experiments demonstrate our approach’s improvements in classification, segmentation, and few-shot tasks over state-of-the-art (SOTA) baselines.
Poster
TANUJ SUR · Samrat Mukherjee · Kaizer Rahaman · Subhasis Chaudhuri · Muhammad Haris Khan · Biplab Banerjee

[ ExHall D ]

Abstract
3D point cloud segmentation is essential across a range of applications; however, conventional methods often struggle in evolving environments, particularly when tasked with identifying novel categories under limited supervision. Few-Shot Learning (FSL) and Class Incremental Learning (CIL) have been adapted previously to address these challenges in isolation, yet the combined paradigm of Few-Shot Class Incremental Learning (FSCIL) remains largely unexplored for point cloud segmentation. To address this gap, we introduce \textbf{Hyperbolic Ideal Prototypes Optimization} (\textsc{HiPo}), a novel framework that harnesses hyperbolic embeddings for FSCIL in 3D point clouds. \textsc{HiPo} employs the Poincaré Hyperbolic Sphere as its embedding space, integrating Ideal Prototypes enriched by CLIP-derived class semantics, to capture the hierarchical structure of 3D data. By enforcing orthogonality among prototypes and maximizing representational margins, \textsc{HiPo} constructs a resilient embedding space that mitigates forgetting and enables the seamless integration of new classes, thereby effectively countering overfitting. Extensive evaluations on S3DIS, ScanNetv2, and cross-dataset scenarios demonstrate \textsc{HiPo}’s strong performance, significantly surpassing existing approaches in both in-domain and cross-dataset FSCIL tasks for 3D point cloud segmentation. \textbf{Code will be released upon acceptance.}
Poster
Jianhui Zhang · Luo Yizhi · Zicheng Zhang · Xuecheng Nie · Bonan Li

[ ExHall D ]

Abstract
Local features aggregation and global information perception are the fundamental to point cloud segmentation. However, existing works often fall short in effectively identifying semantic relevant neighbors and face challenges in endowing each point with high-level information. Here, we propose CamPoint, an innovative method that employs virtual cameras to solve above problems. The core of CamPoint lies in introducing the novel camera visibility feature for points, where each dimension encodes the visibility of that point from a specific camera. Leveraging this feature, we propose the camera perspective slice distance for accurate relevant neighbor searching and design the camera parameter embedding to deliver rich feature representations for global interaction. Specifically, the camera perspective slice distance between two points is defined as a similarity metric derived from their camera visibility features, whereby an increased number of shared cameras observing both points corresponds to a reduced distance between them. To effectively facilitate global semantic perception, we assign each camera an optimizable embedding and then integrate these embeddings into the original spatial features based on visibility attributes, thereby obtaining high-level features enriched with camera priors. Additionally, the state space model characterized by linear computational complexity is employed as the operator to achieve global learning with …
Poster
Radu Berdan · Beril Besbinar · Christoph Reinders · Junji Otsuka · Daisuke Iso

[ ExHall D ]

Abstract
Edge-based computer vision models running on compact, resource-limited devices benefit greatly from using unprocessed, detail-rich RAW sensor data instead of processed RGB images. Training these models, however, necessitates large labeled RAW datasets, which are costly and often impractical to obtain. Thus, converting existing labeled RGB datasets into sensor-specific RAW images becomes crucial for effective model training. In this paper, we introduce ReRAW, an RGB-to-RAW conversion model that achieves state-of-the-art reconstruction performance across five diverse RAW datasets. This is accomplished through ReRAW’s novel multi-head architecture predicting RAW image candidates in gamma space. The performance is further boosted by a stratified sampling-based training data selection heuristic, which helps the model better reconstruct brighter RAW pixels. We finally demonstrate that pretraining compact models on a combination of high-quality synthetic RAW datasets (such as generated by ReRAW) and ground-truth RAW images for downstream tasks like object detection, outperforms both standard RGB pipelines, and RAW fine-tuning of RGB-pretrained models for the same task.
Poster
Zhuochen Yu · Bijie Qiu · Andy W. H. Khong

[ ExHall D ]

Abstract
The sparsity of point clouds poses challenges to current LiDAR-only 3D object detection methods. Recently, methods that convert RGB images into virtual points via depth completion to be fused with LiDAR points have alleviated this issue. Although these methods can achieve outstanding results, they often introduce significant computation overhead due to the high density of virtual points and noises due to inevitable errors in depth completion. At the same time, they do not fully leverage the semantic information from images. In this paper, we propose ViKIENet (Virtual Key Instance Enhanced Network), a highly efficient and effective multi-modal feature fusion framework which fuses the features of virtual key instances (VKIs) with those of LiDAR points in multiple stages. We observed that using only VKIs can enhance the detection performance similar to using all virtual points. ViKIENet has three main components: Semantic Key Instance Selection (SKIS), Virtual Instance Focused Fusion (VIFF) and Virtual-Instance-to-Real Attention (VIRA). ViKIENet-R and VIFF-R are extended versions of ViKIENet and VIFF that include rotationally equivariant features. ViKIENet and ViKIENet-R achieve considerable improvements in detection performance on the KITTI, JRDB and nuScenes datasets. On the KITTI dataset, ViKIENet and ViKIENet-R run fast at 22.7 and 15.0 FPS respectively. We …
Poster
Hala Djeghim · Nathan Piasco · Moussab Bennehar · Luis Guillermo Roldao Jimenez · Dzmitry Tsishkou · Désiré Sidibé

[ ExHall D ]

Abstract
Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct driving scenes due to their large size, highly complex nature and limited visual observation overlap.Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required.To tackle such limitations, we present ViiNeuS, a new hybrid implicit surface learning method that efficiently initializes the signed distance field to reconstruct large driving scenes from 2D street view images.ViiNeuS's hybrid architecture models two separate implicit fields: one representing the volumetric density of the scene, and another one representing the signed distance to the surface.To accurately reconstruct urban outdoor driving scenarios, we introduce a novel volume-rendering strategy that relies on self-supervised probabilistic density estimation to sample points near the surface and transition progressively from volumetric to surface representation. Our solution permits a proper and fast initialization of the signed distance field without relying on any geometric prior on the scene, compared to concurrent methods.By conducting extensive experiments on four outdoor driving datasets, we show that ViiNeuS can learn an accurate and detailed 3D surface scene representation in various driving scenarios while being two times faster to train compared …
Poster
Jichun Zhao · Haiyong Jiang · Haoxuan Song · Jun Xiao · Dong Gong

[ ExHall D ]

Abstract
Adapting pre-trained LiDAR segmentation models to dynamic domain shifts during testing is of paramount importance for the safety of autonomous driving. Most existing methods neglect the influence of domain changes on the continual test-time adaption (CTTA) and require backpropagation and large batch sizes for stable adaption.We approach this problem with three insights: 1) Distance of a point to the LiDAR sensor is highly relevant to its local density; 2) The feature distribution of different domains varies, and domain-aware parameters can alleviate domain gaps; 3) Features are highly correlated and make segmentation of different labels confusing. To this end, this work presents D^3CTTA, an online backpropagation-free framework for 3D continual test-time adaption for LiDAR segmentation.D^3CTTA consists of a distance-aware prototype learning module to integrate LiDAR-based geometry prior and a domain-dependent decorrelation module to reduce feature correlations among different domains and different categories.Extensive experiments on three benchmarks showcase that our method achieves a state-of-the-art performance compared to both backpropagation-based methods and backpropagation-free methods.
Poster
Alexey Nekrasov · Malcolm Burdorf · Stewart Worrall · Bastian Leibe · Julie Stephany Berrio Perez

[ ExHall D ]

Abstract
For safe operation, autonomous vehicles (AVs) must detect and handle unexpected objects or anomalies on the road. While anomaly detection and segmentation have been explored in 2D images, a gap remains for similar tasks in 3D LiDAR point clouds. Existing datasets lack high-quality multimodal data typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and benchmarked several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly accessible, facilitating testing and performance comparison across diverse approaches.
Poster
Daizong Liu · Wei Hu

[ ExHall D ]

Abstract
Deep learning models for 3D data have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving and robotic navigation. Existing 3D attackers mainly put effort into attacking the simple 3D classification model by perturbing point cloud objects in the white/black-box setting. However, real-world 3D applications focus on tackling more complicated scene-based data while sharing no information about the model parameters and logits with users. Therefore, directly applying previous naive 3D attack methods to these applications does not work. To this end, this paper attempts to address the challenging hard-label 3D scene attack with access only to the input/output of the 3D models. To make the attack effective and stealthy, we propose to generate universal adversarial objects, which will mislead scene-aware 3D models to predict attacker-chosen labels whenever these objects are placed on any scene input. Specifically, we inject an imperceptible object trigger with further perturbations into all scenes and learn to mislead their reasoning by only querying the 3D model. We start by initializing the trigger pattern with a realistic object and searching for an appropriate location to place it naturally in the scene data. Then, we design a …
Poster
Houzhang Fang · Xiaolin Wang · Zengyang Li · Lu Wang · Qingshan Li · Yi Chang · Luxin Yan

[ ExHall D ]

Abstract
Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature-dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection-beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on …
Poster
Shihang Du · Sanqing Qu · Tianhang Wang · Xudong Zhang · Yunwei Zhu · Jian Mao · Fan Lu · Qiao Lin · Guang Chen

[ ExHall D ]

Abstract
Collaborative perception enhances single-vehicle perception by integrating sensory data from multiple connected vehicles. However, existing studies often assume ideal conditions, overlooking resilience to real-world challenges, such as adverse weather and sensor malfunctions, which is critical for safe deployment. To address this gap, we introduce RCP-Bench, the first comprehensive benchmark designed to evaluate the robustness of collaborative detection models under a wide range of real-world corruptions. RCPBench includes three new datasets (i.e., OPV2V-C, V2XSet-C, and DAIR-V2X-C) that simulate six collaborative cases and 14 types of camera corruption resulting from external environmental factors, sensor failures, and temporal misalignments. Extensive experiments on 10 leading collaborative perception models reveal that, while these models perform well under ideal conditions, they are significantly affected by corruptions. To improve robustness, we propose two simple yet effective strategies, RCP-Drop and RCP-Mix, based on training regularization and feature augmentation. Additionally, we identify several critical factors influencing robustness, such as backbone architecture, camera number, feature fusion methods, and the number of connected vehicles. We hope that RCP-Bench, along with these strategies and insights, will stimulate future research toward developing more robust collaborative perception models. Our benchmark toolkit will be made public.
Poster
Jiahui Fu · Yue Gong · Luting Wang · Shifeng Zhang · Xu Zhou · Si Liu

[ ExHall D ]

Abstract
Collaborative perception aims to address the constraint of single-agent perception by exchanging information among multiple agents. Previous works primarily focus on collaborative object detection, exploring compressed transmission and fusion prediction tailored to sparse object features. However, these strategies are not well-suited for dense features in collaborative BEV semantic segmentation. Therefore, we propose CoGMP, a novel Collaborative framework that leverages Generative Map Priors to achieve efficient compression and robust fusion. CoGMP introduces two key innovations: Element Format Feature Compression (EFFC) and Structure Guided Feature Fusion (SGFF). Specifically, EFFC leverages map element priors from codebook to encode BEV features as discrete element indices for transmitted information compression. Meanwhile, SGFF utilizes a diffusion model with structural priors to coherently integrate multi-agent features, thereby achieving consistent fusion predictions. Evaluations on the OPV2V dataset show that CoGMP achieves a 6.89/7.64 Road/Lane IoU improvement and a 32-fold reduction in communication volume. The code can be found in the supplementary materials.
Poster
Xiyue Guo · Jiarui Hu · Junjie Hu · Hujun Bao · Guofeng Zhang

[ ExHall D ]

Abstract
Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that pre-corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. We will make our source code publicly available soon.
Poster
Jongseong Bae · Junwoo Ha · Ha Young Kim

[ ExHall D ]

Abstract
Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
Poster
Heng Li · Yuenan Hou · Xiaohan Xing · Yuexin Ma · Xiao Sun · Yanyong Zhang

[ ExHall D ]

Abstract
Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. Specifically, we first design the hierarchical Mamba module and local context processor to better aggregate global and local contextual information, respectively. Besides, to relieve the inherent domain gap between the linguistic and 3D domains, we present a simple yet effective 3D-to-1D reordering scheme, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of 3D voxels as well as facilitate the processing of Mamba blocks. Endowed with the aforementioned designs, our OccMamba is capable of directly and efficiently processing large volumes of dense scene grids, achieving state-of-the-art performance across three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI, and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the …
Poster
Haoyi Jiang · Liu Liu · Tianheng Cheng · Xinjie wang · Tianwei Lin · Zhizhong Su · Wenyu Liu · Xinggang Wang

[ ExHall D ]

Abstract
3D Semantic Occupancy Prediction is pivotal for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that aligns with foundation models to enhance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians representing scenes in a feed-forward manner. Through the alignment of rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations, thereby enabling open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50\%. These results highlight the significant potential of GaussTR for advancing scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. The code will be made publicly available in due course.
Poster
Bohan Li · Jiazhe Guo · Hongsi Liu · Yingshuang Zou · Yikang Ding · Xiwu Chen · Hu ZHU · Feiyang Tan · Chi Zhang · Tiancai Wang · Shuchang Zhou · Li Zhang · Xiaojuan Qi · Hao Zhao · Mu Yang · Wenjun Zeng · Xin Jin

[ ExHall D ]

Abstract
Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms — semantic occupancy, video, and LiDAR — in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks. The code is available in the supplementary.
Poster
Georg Hess · Carl Lindström · Maryam Fatemi · Christoffer Petersson · Lennart Svensson

[ ExHall D ]

Abstract
Ensuring the safety of autonomous robots, such as self-driving vehicles, requires extensive testing across diverse driving scenarios. Simulation is a key ingredient for conducting such testing in a cost-effective and scalable way. Neural rendering methods have gained popularity, as they can build simulation environments from collected logs in a data-driven manner. However, existing neural radiance field (NeRF) methods for sensor-realistic rendering of camera and lidar data suffer from low rendering speeds, limiting their applicability for large-scale testing. While 3D Gaussian Splatting (3DGS) enables real-time rendering, current methods are limited to camera data and are unable to render lidar data essential for autonomous driving. To address these limitations, we propose SplatAD, the first 3DGS-based method for realistic, real-time rendering of dynamic scenes for both camera and lidar data. SplatAD accurately models key sensor-specific phenomena such as rolling shutter effects, lidar intensity, and lidar ray dropouts, using purpose-built algorithms to optimize rendering efficiency. Evaluation across three autonomous driving datasets demonstrates that SplatAD achieves state-of-the-art rendering quality with up to +2 PSNR for NVS and +3 PSNR for reconstruction while increasing rendering speed over NeRF-based methods by an order of magnitude. Code to be released upon publication.
Poster
Katrin Renz · Long Chen · Elahe Arani · Oleg Sinavski

[ ExHall D ]

Abstract
Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is grounded in the action space. Otherwise, the model’s answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Leaderboard 2.0 and the Bench2Drive benchmarks. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance. We will release code, data and models upon acceptance.
Poster
Lue Fan · Hao ZHANG · Qitai Wang · Hongsheng Li · Zhaoxiang Zhang

[ ExHall D ]

Abstract
We propose FreeSim, a camera simulation method for driving scenes. FreeSim emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In such viewpoints, previous methods have unacceptable degradation because the training data of these viewpoints is unavailable. To address such data scarcity, we first propose a generative enhancement model with a matched data construction strategy. The resulting model can generate high-quality images in a viewpoint slightly deviated from the recorded trajectories, conditioned on the degraded rendering of this viewpoint. We then propose a progressive reconstruction strategy, which progressively adds generated images in unrecorded views into the reconstruction process, starting from slightly off-trajectory viewpoints and moving progressively farther away. With this progressive generation-reconstruction pipeline, FreeSim supports high-quality off-trajectory view synthesis under large deviations of more than 3 meters.
Poster
Guosheng Zhao · Chaojun Ni · Xiaofeng Wang · Zheng Zhu · Xueyang Zhang · Yida Wang · Guan Huang · xinze chen · Boyuan Wang · Youyi Zhang · Wenjun Mei · Xingang Wang

[ ExHall D ]

Abstract
Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments.In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1\%, 46.4\%, and 16.3\% compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly …
Poster
Tai-Yu Daniel Pan · Sooyoung Jeon · Mengdi Fan · Jinsu Yoo · Zhenyang Feng · Mark Campbell · Kilian Q Weinberger · Bharath Hariharan · Wei-Lun Chao

[ ExHall D ]

Abstract
Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample -- the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV.We present the very first solution, using a combination of synthetic collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.
Poster
Bencheng Liao · Shaoyu Chen · haoran yin · Bo Jiang · Cheng Wang · Sixu Yan · xinbang zhang · Xiangyu Li · ying zhang · Qian Zhang · Xinggang Wang

[ ExHall D ]

Abstract
Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10× reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available for future research.
Poster
Zhiying Song · Lei Yang · Fuxi Wen · Jun Li

[ ExHall D ]

Abstract
Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle’s current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception. Notably, our method shows minimal average precision (AP50) drops of only 4.87% and 5.68% at 400 ms latency on the two datasets, respectively.
Poster
Yizhou Huang · Yihua Cheng · Kezhi Wang

[ ExHall D ]

Abstract
Motion prediction is crucial for autonomous driving systems, as it enables accurate forecasting of future vehicle trajectories based on historical motion data. This paper introduces Trajectory Mamba (Tamba), a novel efficient trajectory prediction framework based on the selective state-space (SSM) model. Conventional attention-based models face the challenge of computational costs that grow quadratically with the number of targets, hindering their application in highly dynamic environments. To address this, Tamba leverages the SSM module to redesign the self-attention mechanism in the encoder-decoder architecture, thereby achieving linear time complexity.To address the potential reduction in prediction accuracy resulting from modifications to the attention mechanism, we propose a joint polyline encoding strategy to better capture the associations between static and dynamic contexts, ultimately enhancing prediction accuracy. In addition, to achieve a better balance between prediction accuracy and inference speed, we adopted a structure in the decoder that differs entirely from the encoder. Through cross-state space attention, all target agents share the scene context, allowing the SSM to interact with the shared scene representation during decoding, thus inferring different trajectories over the next prediction steps.Our model achieves state-of-the-art (SOTA) results in terms of inference speed and parameter efficiency on both the Argoverse 1 and Argoverse …
Poster
Xuesong Chen · Linjiang Huang · Tao Ma · Rongyao Fang · Shaoshuai Shi · Hongsheng Li

[ ExHall D ]

Abstract
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and real-time decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient asynchronous cooperation, aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and interpretable autonomous driving systems.
Poster
Xinshuai Song · weixing chen · Yang Liu · Weikai Chen · Guanbin Li · Liang Lin

[ ExHall D ]

Abstract
Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing …
Poster
Zhi-Yuan Zhang · Xiaofan Li · Zhihao Xu · Wenjie Peng · Zijian Zhou · Miaojing Shi · Shuangping Huang

[ ExHall D ]

Abstract
Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial perception capabilities.Previous works typically express spatial comprehension through textual representations of spatial coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions.This oversight hinders the accurate transmission of spatial information and increases the expressive burden.To address this, we propose Marker-based Prompt Learning framework (MPDrive), which transforms spatial coordinates into concise visual markers, ensuring linguistic consistency and enhancing the accuracy of visual perception and spatial expression in AD-VQA.Specifically, MPDrive converts complex spatial coordinates into text-based visual marker predictions, simplifying the expression of spatial information for autonomous decision-making.Moreover, we introduce visual marker images as conditional inputs and integrate object-level fine-grained features to further enhance multi-level spatial perception abilities.Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive performs at state-of-the-art levels, particularly in cases requiring sophisticated spatial understanding.
Poster
Hao Ren · Yiming Zeng · Zetong Bi · Zhaoliang Wan · Junlong Huang · Hui Cheng

[ ExHall D ]

Abstract
Recent advancements in diffusion-based imitation learning, which shows impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Minimal implementation codes are available in supplementary materials.
Poster
Steeven JANNY · Hervé Poirier · Leonid Antsfeld · Guillaume Bono · Gianluca Monaci · Boris Chidlovskii · Francesco Giuliari · Alessio Del Bue · Christian Wolf

[ ExHall D ]

Abstract
Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but evaluations and benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at https://visual-navigation-reasoning.github.io
Poster
Shaofei Cai · Zihao Wang · Kewei Lian · Zhancun Mu · Xiaojian Ma · Anji Liu · Yitao Liang

[ ExHall D ]

Abstract
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a 76\% absolute improvement in open-world interaction performance. Codes and demos will be released.
Poster
Can Zhang · Gim Hee Lee

[ ExHall D ]

Abstract
This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method. We will make our code open-source upon paper acceptance.
Poster
Xin Wen · Bingchen Zhao · Yilun Chen · Jiangmiao Pang · Xiaojuan Qi

[ ExHall D ]

Abstract
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data—a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. We will make our code and model artifacts publicly available.
Poster
Yanbang Li · ZiYang Gong · Haoyang Li · Xiaoqi Huang · Haolan Kang · Guangpingbai · Xianzheng Ma

[ ExHall D ]

Abstract
Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the ***Robotic Visual Instruction (RoVI)***, a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present ***Visual Instruction Embodied Workflow (VIEW)***, a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets …
Poster
Sangmin Lee · Sungyong Park · Heewon Kim

[ ExHall D ]

Abstract
Creating robotic manipulation datasets is traditionally labor-intensive and expansive, requiring extensive manual effort. To alleviate this problem, we introduce PhaseScene, which generates realistic and diverse dynamic scenes (or robotic manipulation data) from text instructions for Embodied AI. PhaseScene employs a phase-specific data representation by dividing dynamic scenes into static environments and robot movements. Each phase utilizes a diffusion-based method to generate phase-specific data, incorporating data refinement and augmentation techniques. Our experiments demonstrate that PhaseScene outperforms human creation by about 20 times faster speed, 1.84 times accuracy, and 28% higher action diversity based on standard metrics. Additionally, the generated scenes enable accurate agent training with an average success rate improvement of 7.96% for PerAct and 11.23% for PerAct-PSA.
Poster
Sen Wang · Le Wang · Sanping Zhou · Jingyi Tian · lijiayi · Haowen Sun · Wei Tang

[ ExHall D ]

Abstract
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which enables adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, which simplifies the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0\% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 …
Poster
Ning Gao · Yilun Chen · Shuai Yang · Xinyi Chen · Yang Tian · Hao Li · Haifeng Huang · Hanqing Wang · Tai Wang · Jiangmiao Pang

[ ExHall D ]

Abstract
Robotic manipulation in real-world settings presents significant challenges, particularly in achieving reliable performance across diverse real-world conditions. However, existing simulation platforms often lack the necessary support for studying policy generalization across varied tasks and conditions, falling behind the growing interest in leveraging foundation models. To address these limitations, we introduce \textbf{GenManip}, a realistic tabletop simulation platform designed to study policy generalization. The platform features a \textbf{task-oriented scene graph}-based scenario generation driven by GPT capabilities, enabling large-scale everyday task synthesis using \textit{10K} 3D assets. To investigate the generalization of robotic manipulation, we introduce \textbf{GenManip-Bench}, a benchmark comprising 250 task scenarios derived from generated tasks and refined through human-in-the-loop correction. We focus on two key areas: a modular manipulation system that employs foundation models for component-specific analysis, and end-to-end policy exploration using the scalable data collection pipeline. Experimental results show that while data scaling benefits learning-based policies, their generalization remains limited compared to modular approaches using foundation models. We expect this platform to offer critical insights for advancing policy generalizability in realistic settings. All code will be made publicly available.
Poster
Wenbo Wang · Fangyun Wei · Lei Zhou · Xi Chen · Lin Luo · Xiaohan Yi · Yizhong Zhang · Yaobo Liang · Chang Xu · Yan Lu · Jiaolong Yang · Baining Guo

[ ExHall D ]

Abstract
We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting.
Poster
Youxin Pang · Ruizhi Shao · Jiajun Zhang · Hanzhang Tu · Yun Liu · Boyao Zhou · Hongwen Zhang · Yebin Liu

[ ExHall D ]

Abstract
In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.
Poster
Shijian Jiang · Qi Ye · Rengan Xie · Yuchi Huo · Jiming Chen

[ ExHall D ]

Abstract
This work aims to reconstruct the 3D geometry of a rigid object manipulated by one or both hands using monocular RGB video. Previous methods rely on Structure-from-Motion or hand priors to estimate relative motion between the object and camera, which typically assume textured objects or single-hand interactions. To accurately recover object geometry in dynamic hand-object interactions, we incorporate priors from 3D generation models into object pose estimation and propose semantic consistency constraints to solve the challenge of shape and texture discrepancy between the generated priors and observations. The poses are initialized, followed by joint optimization of the object poses and implicit neural representation. During the optimization, a novel pose outlier voting strategy with inter-view consistency is proposed to correct large pose errors. Experiments on three datasets demonstrate that our method significantly outperforms the state-of-the-art in reconstruction quality for both single- and two-hand scenarios.
Poster
Yinqiao Wang · Hao Xu · Pheng-Ann Heng · Chi-Wing Fu

[ ExHall D ]

Abstract
Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE’s SOTA performance in addressing hand-only and hand-object scenarios. Code will be publicly released upon publication.
Poster
Rolandos Alexandros Potamias · Jinglei Zhang · Jiankang Deng · Stefanos Zafeiriou

[ ExHall D ]

Abstract
In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models and dataset will be made publicly available.
Poster
Zhuoran ZHAO · Linlin Yang · Pengzhan Sun · Pan Hui · Angela Yao

[ ExHall D ]

Abstract
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Source code and data will be released upon acceptance.
Poster
Sirui Xu · Hung Yu Ling · Yu-Xiong Wang · Liangyan Gui

[ ExHall D ]

Abstract
Achieving realistic simulations of humans engaging in a wide range of object interactions has long been a fundamental goal in animation. Extending physics-based motion imitation techniques to complex human-object interactions (HOIs) is particularly challenging due to the intricate coupling between human-object dynamics and the variability in object geometries and properties. Moreover, motion capture data often contain artifacts such as inaccurate contacts and insufficient hand details, which hinder the learning process. We introduce InterMimic, a framework that overcomes these challenges by enabling a single policy to robustly learn from imperfect motion capture sequences encompassing tens of hours of diverse full-body interaction skills with dynamic and varied objects. Our key insight is employing a curriculum strategy: perfecting first, then scaling up. We first train subject-specific teacher policies to mimic, retarget, and refine the motion capture data, effectively correcting imperfections. Then, we distill a student policy from these teachers; the teachers act as online experts providing direct supervision and supplying clean references. This ensures that the student policy learns from high-quality guidance despite imperfections in the original dataset. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across various HOI datasets. Notably, the learned policy exhibits zero-shot generalization, allowing seamless integration with …
Poster
Uyoung Jeong · Jonathan Freer · Seungryul Baek · Hyung Jin Chang · Kwang In Kim

[ ExHall D ]

Abstract
Human pose estimation is in increasing demand across diverse applications, from avatar generation to human-robot interaction. However, the domains of these applications often diverge from standard human pose estimation datasets, leading to limited domain transfer. Particularly in multi-dataset training (MDT), there are often variations in skeleton types and limited comprehensive supervision across them.We propose a novel MDT framework, called PoseBH, that integrates poses beyond humans.Our method addresses keypoint heterogeneity and limited supervision through two primary techniques. First, we introduce nonparametric keypoint prototypes that learn on a unified embedding space, enabling seamless integration across arbitrary skeleton types and facilitating robust domain transfer. Second, we introduce a cross-modal self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, thus enhancing supervision without reliance on teacher-student models or additional augmentations.PoseBH demonstrates significant generalization improvements on whole-body and animal pose datasets (COCO-WholeBody, AP-10K, APT-36K), while maintaining the performance of the standard human pose benchmarks (COCO, MPII, AIC). Our learned keypoint embeddings also transfer well to hand shape (InterHand2.6M) and human shape (3DPW) domains.
Poster
Qingzheng Xu · Ru Cao · Xin Shen · Heming Du · Sen Wang · Xin Yu

[ ExHall D ]

Abstract
Human pose estimation is a critical task in computer vision for applications in sports analysis, healthcare monitoring, and human-computer interaction. However, existing human pose datasets are collected either from custom-configured laboratories with complex devices or they only include data on single individuals, and both types typically capture daily activities. In this paper, we introduce the M3GYM dataset, a large-scale multimodal, multi-view, and multi-person pose dataset collected from a real gym to address the limitations of existing datasets.Specifically, we collect videos for 82 sessions from the gym, each session lasting between 40 to 60 minutes. These videos are gathered by 8 cameras, including over 50 subjects and 47 million frames. These sessions include 51 Normal fitness exercise sessions as well as 17 Pilates and 14 Yoga sessions. The exercises cover a wide range of poses and typical fitness activities, particularly in Yoga and Pilates, featuring poses with stretches, bends, and twists, \eg, humble warrior, fire hydrants and knee hover side twists.Each session involves multiple subjects, leading to significant self-occlusion and mutual occlusion in single views.Moreover, the gym has two symmetric floor mirrors, a feature not seen in previous datasets, and seven lighting conditions. We provide frame-level multimodal annotations, including 2D\&3D keypoints, …
Poster
Mohammadhossein Bahari · Saeed Saadatnejad · Amirhossein Askari Farsangi · Seyed-Mohsen Moosavi-Dezfooli · Alex Alahi

[ ExHall D ]

Abstract
Predicting human trajectories is essential for the safe operation of autonomous vehicles, yet current data-driven models often lack robustness in case of noisy inputs such as adversarial examples or imperfect observations. Although some trajectory prediction methods have been developed to provide empirical robustness, these methods are heuristic and do not offer guaranteed robustness.In this work, we propose a certification approach tailored for trajectory prediction that provides guaranteed robustness. To this end, we address the unique challenges associated with trajectory prediction, such as unbounded outputs and multi-modality. To mitigate the inherent performance drop through certification, we propose a diffusion-based trajectory denoiser and integrate it into our method. Moreover, we introduce new certified performance metrics to reliably measure the trajectory prediction performance. Through comprehensive experiments, we demonstrate the accuracy and robustness of the certified predictors and highlight their advantages over the non-certified ones. The code will be released upon publication.
Poster
Ming Yan · Xincheng Lin · Yuhua Luo · Shuqi Fan · Yudi Dai · Qixin Zhong · Lincai Zhong · Yuexin Ma · Lan Xu · Chenglu Wen · Siqi Shen · Cheng Wang

[ ExHall D ]

Abstract
Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, which includes the challenging climbing motions of 22 professional climbing coaches across 12 different rocks. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and the LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and the source code …
Poster
Hiromu Taketsugu · Takeru Oba · Takahiro Maeda · Shohei Nobuhara · Norimichi Ukita

[ ExHall D ]

Abstract
Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics.While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads.Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method further enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code will be publicly available.
Poster
Ting Yu · Yi Lin · Jun Yu · Zhenyu Lou · Qiongjie Cui

[ ExHall D ]

Abstract
Recent advances in human motion prediction (HMP) have shifted focus from isolated motion data to integrating human-scene correlations. In particular, the latest methods leverage human gaze points, using their spatial coordinates to indicate intent—where a person might move within a 3D environment. Despite promising trajectory results, these methods often produce inaccurate poses by overlooking the semantic implications of gaze, specifically the affordances of observed objects, which indicate the possible interactions. To address this, we propose GAP3DS, an affordance-aware HMP model that utilizes gaze-informed object affordances to improve HMP in complex 3D environments. GAP3DS incorporates a gaze-guided affordance learner to identify relevant objects in the scene and infer their affordances based on human gaze, thus contextualizing future human-object interactions. This affordance information, enriched with visual features and gaze data, conditions the generation of multiple human-object interaction poses, which are subsequently decoded into final motion predictions. Extensive experiments on two real-world datasets demonstrate that GAP3DS outperforms state-of-the-art methods in both trajectory and pose accuracy, producing more physically consistent and contextually grounded predictions.
Poster
Dongyang Jin · Chao Fan · Jingzhe Ma · Jingkai Zhou · Weihua Chen · Shiqi Yu

[ ExHall D ]

Abstract
To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. To address this, emerging end-to-end methods focus on directly denoising RGB videos using global optimization and human-defined priors. Building on this trend, we propose a novel gait denoising method, DenosingGait. Inspired by the philosophy that “what I cannot create, I do not understand”, we turn to generative diffusion models, uncovering how these models can partially filter out irrelevant factors for improved gait understanding. Based on this generation-driven denoising, we introduce feature matching, a kind of popular geometrical constraint in optical flow and depth estimation, to compact multi-channel float-encoded RGB information into two-channel direction vectors that represent local structural features, where within-frame matching captures spatial details and cross-frame matching conveys temporal dynamics. Experiments on the CCPG, CAISA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within-domain and cross-domain evaluations.All the code will be released.
Poster
Lingan Zeng · Guohong Huang · Yi-Lin Wei · Shengbo Gu · Yu-Ming Tang · Jingke Meng · Wei-Shi Zheng

[ ExHall D ]

Abstract
We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs.
Poster
Tao Wang · Zhihua Wu · Qiaozhi He · Jiaming Chu · Ling Qian · Yu Cheng · Junliang Xing · Jian Zhao · Lei Jin

[ ExHall D ]

Abstract
Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion's performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman's position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.
Poster
Pablo Ponce Ponce · German Barquero · Cristina Palmero · Sergio Escalera · Jose Garcia-Rodriguez

[ ExHall D ]

Abstract
Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality …
Poster
Boyuan Wang · Xiaofeng Wang · Chaojun Ni · Guosheng Zhao · Zhiqin Yang · Zheng Zhu · Muyang Zhang · YuKun Zhou · xinze chen · Guan Huang · lihong liu · Xingang Wang

[ ExHall D ]

Abstract
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4\%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8\%, 26.3\%, and 18.3\%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
Poster
Neerja Thakkar · Tara Sadjadpour · Jathushan Rajasegaran · Shiry Ginosar · Jitendra Malik

[ ExHall D ]

Abstract
We introduce a simple framework for predicting the behavior of an ego agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent’s future behavior by reasoning about the ego agent’s state history and the current state of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent’s state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action prediction in social situations, trajectory prediction for autonomous vehicles, and object pose prediction during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across our three scenarios.
Poster
Baixuan Lv · Yaohua Zha · Tao Dai · Xue Yuerong · Ke Chen · Shu-Tao Xia

[ ExHall D ]

Abstract
Point cloud video understanding is becoming increasingly important in fields such as robotics, autonomous driving, and augmented reality, as they can accurately represent object motion and environmental changes. Despite the progress made in self-supervised learning methods for point cloud video understanding, the limited availability of 4D data and the high computational cost of training 4D-specific models remain significant obstacles. In this paper, we investigate the potential of transferring pre-trained static 3D point cloud models to the 4D domain, identifying the limitations of static models that capture only spatial information while neglecting temporal dynamics. To address this, we propose a novel Cross-frame Spatio-temporal Adaptation (CSA) strategy by introducing the Point Tube Adapter as the embedding layer and the Geometric Constraint Temporal Adapter (GCTA) to enforce temporal consistency across frames. This strategy extracts both short-term and long-term temporal dynamics, effectively integrating them with spatial features and enriching the model’s understanding of temporal changes in point cloud videos. Extensive experiments on 3D action and gesture recognition tasks demonstrate that our method achieves state-of-the-art performance, establishing its effectiveness for point cloud video understanding.
Poster
Jaeah Lee · Changwoon Choi · Young Min Kim · Jaesik Park

[ ExHall D ]

Abstract
Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage.
Poster
Jinxi Li · Ziyang Song · Siyuan Zhou · Bo Yang

[ ExHall D ]

Abstract
In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or requiring object priors such as masks or types. In this paper, we propose **NGV** to learn physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on two public datasets and a newly collected challenging real-world dataset demonstrate superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.
Poster
Chris Rockwell · Joseph Tung · Tsung-Yi Lin · Ming-Yu Liu · David Fouhey · Chen-Hsuan Lin

[ ExHall D ]

Abstract
Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation.However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation.Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-the-art methods.In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses.Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models.For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches.Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
Poster
Jingxi Chen · Brandon Y. Feng · Haoming Cai · Tianfu Wang · Levi Burner · Dehao Yuan · Cornelia Fermuller · Christopher Metzler · Yiannis Aloimonos

[ ExHall D ]

Abstract
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied upon a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
Poster
Rick Akkerman · Haiwen Feng · Michael J. Black · Dimitrios Tzionas · Victoria Abrevaya

[ ExHall D ]

Abstract
Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics “simulators” by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions, while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Code and trained models will …
Poster
Emanuele Aiello · Umberto Michieli · Diego Valsesia · Mete Ozay · Enrico Magli

[ ExHall D ]

Abstract
Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.
Poster
Hanlin Wang · Hao Ouyang · Qiuyu Wang · Wen Wang · Ka Leong Cheng · Qifeng Chen · Yujun Shen · Limin Wang

[ ExHall D ]

Abstract
The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
Poster
Xi Chen · Zhifei Zhang · He Zhang · Yuqian Zhou · Soo Ye Kim · Qing Liu · Yijun Li · Jianming Zhang · Nanxuan Zhao · Yilin Wang · Hui Ding · Zhe Lin · Hengshuang Zhao

[ ExHall D ]

Abstract
We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.
Poster
Jie Tian · Xiaoye Qu · Zhenyi Lu · Wei Wei · Sichen Liu · Yu Cheng

[ ExHall D ]

Abstract
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. In this paper, we propose a novel Extrapolating and Decoupling framework to mitigate these issues. Specifically, our framework consists of three separate stages:(1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly.(3) With the above two-stage models excelling in motion controllability and motion degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been …
Poster
Yao-Chih Lee · Erika Lu · Sarah Rumbley · Michal Geyer · Jia-Bin Huang · Tali Dekel · Forrester Cole

[ ExHall D ]

Abstract
Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections.Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions.We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, anddemonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.
Poster
Uri Gadot · Shie Mannor · Assaf Shocher · Gal Chechik · Assaf Hallak

[ ExHall D ]

Abstract
Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to …
Poster
Zhaoyang Jia · Bin Li · Jiahao Li · Wenxuan Xie · Linfeng Qi · Houqiang Li · Yan Lu

[ ExHall D ]

Abstract
We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs is influenced by 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our NVC achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code will be released.
Poster
Chuanbo Tang · Zhuoyuan Li · Yifan Bian · Li Li · Dong Liu

[ ExHall D ]

Abstract
Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM.
Poster
Zeyu Xiao · Xinchao Wang

[ ExHall D ]

Abstract
Exploiting temporal correlations is crucial for video super-resolution (VSR). Recent approaches enhance this by incorporating event cameras. In this paper, we introduce MamEVSR, an Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. MamEVSR stands out by offering global receptive field coverage with linear computational complexity, thus addressing the limitations of convolutional neural networks and Transformers. The key components of MamEVSR include: (1) The interleaved Mamba (iMamba) block, which interleaves tokens from adjacent frames and applies multi-directional selective state space modeling, enabling efficient feature fusion and propagation across bi-directional frames while maintaining linear complexity. (2) The cross-modality Mamba (cMamba) block facilitates further interaction and aggregation between event information and the output from the iMamba block. The cMamba block can leverage complementary spatio-temporal information from both modalities and allows MamEVSR to capture finer motion details. Experimental results show that the proposed MamEVSR achieves superior performance on various datasets quantitatively and qualitatively.
Poster
Shuaizhen Yao · Xiaoya Zhang · Xin Liu · Mengyi Liu · Zhen Cui

[ ExHall D ]

Abstract
Diffusion probabilistic model is becoming the cornerstone of data generation, especially generating high-quality images. As an extension, video diffusion generation is in urgent need of a principled temporal-sequence diffusion way, while the spatial-domain diffusion dominates most video diffusion methods. In this work, we propose an explicit Spatio-Temporal Dual Diffusion (STDD) method by principledly extending the standard diffusion model to a spatio-temporal diffusion model for joint spatial and temporal noise propagation/reduction. Mathematically, an analysable dual diffusion process is derived to accumulate noises/information in temporal sequence as well as spatial domain. Correspondingly, we theoretically derive a spatio-temporal probabilistic reverse diffusion process and propose an accelerated sampling way to reduce the inference cost. In principle, the spatio-temporal dual diffusion enables the information of previous frames to be transferred to the current frame, which thus could be beneficial for video consistency. Extensive experiments demonstrate that our proposed STDD is more competitive over the state-of-the-art methods in the task of video generation/prediction as well as text-to-video generation.
Poster
Jingyi Xu · Siwei Tu · Weidong Yang · Ben Fei · Shuhao Li · Keyi Liu · Yeqi Luo · Lipeng Ma · Lei Bai

[ ExHall D ]

Abstract
Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence have made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-module cooperative deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages a vision transformer to generate coarse yet superior forecasting results over previous methods at a regular 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next module. Subsequently, an unconditional diffusion model pre-trained on low-resolution sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with a 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.
Poster
Xiaofeng Mao · Zhengkai Jiang · Fu-Yun Wang · Jiangning Zhang · Hao Chen · Mingmin Chi · Yabiao Wang · Wenhan Luo

[ ExHall D ]

Abstract
Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. Although techniques such as consistency distillation and adversarial training have been employed to accelerate video diffusion by reducing inference steps, these methods often simply transfer the generation approaches from Image diffusion models to video diffusion models. As a result, these methods frequently fall short in terms of both performance and training stability. In this work, we introduce a two-stage training framework that effectively combines consistency distillation with adversarial training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance (FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
Poster
Dongnan Gui · Xun Guo · Wengang Zhou · Yan Lu

[ ExHall D ]

Abstract
Recent advances in image-to-video generation have enabled animation of still images and offered pixel-level controllability. While these models hold great potential to transform single images into vivid and dynamic videos, they also carry risks of misuse that could impact privacy, security, and copyright protection. This paper proposes a novel approach that applies imperceptible perturbations on images to degrade the quality of the generated videos, thereby protecting images from misuse in white-box image-to-video diffusion models. Specifically, we function our approach as an adversarial attack, incorporating spatial, temporal, and diffusion attack modules. The spatial attack shifts image features from their original distribution to a lower-quality target distribution, reducing visual fidelity. The temporal attack disrupts coherent motion by interfering with temporal attention maps that guide motion generation. To enhance the robustness of our approach across different models, we further propose a diffusion attack module leveraging contrastive loss. Our approach can be easily integrated with mainstream diffusion-based I2V models. Extensive experiments on SVD, CogVideoX, and ControlNeXt demonstrate that our method significantly impairs generation quality in terms of visual clarity and motion consistency, while introducing only minimal artifacts to the images. To the best of our knowledge, we are the first to explore adversarial attacks …
Poster
Zhaolin Wan · Han Qin · Zhiyang Li · Xiaopeng Fan · Wangmeng Zuo · Debin Zhao

[ ExHall D ]

Abstract
Omnidirectional videos (ODVs) present distinct challenges for accurate audio-visual saliency prediction due to their immersive nature, which combines spatial audio with panoramic visuals to enhance the user experience. While auditory cues are crucial for guiding visual attention across the panoramic scene, the interaction between audio and visual stimuli in ODVs remains underexplored. Existing models primarily focus on spatiotemporal visual cues and treat audio signals separately from their spatial and temporal contexts, often leading to misalignments between audio and visual content and undermining temporal consistency across frames. To bridge these gaps, we propose a novel audio-induced saliency prediction model for ODVs that holistically integrates audio and visual inputs through a multi-modal encoder, an audio-visual interaction module, and an audio-visual transformer. Unlike conventional methods that isolate audio cue locations and attributes, our model employs a query-based framework, where learnable audio queries capture comprehensive audio-visual dependencies, thus enhancing saliency prediction by dynamically aligning with audio cues. Besides, we introduce a novel consistency loss to enforce temporal coherence in saliency regions across frames. Extensive experiments demonstrate that our model outperforms state-of-the-art methods in predicting audio-visual salient regions in ODVs, establishing its robustness and superior performance.
Poster
Zhiyuan Yan · Yandan Zhao · Shen Chen · Mingyi Guo · Xinghe Fu · Taiping Yao · Shouhong Ding · Yunsheng Wu · Li Yuan

[ ExHall D ]

Abstract
Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pre-trained image model with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel …
Poster
Jingkai Wang · Jue Gong · Lin Zhang · Zheng Chen · Xing Liu · Hong Gu · Yutong Liu · Yulun Zhang · Xiaokang Yang

[ ExHall D ]

Abstract
Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released soon.
Poster
Mengqiu XU · Kaixin Chen · Heng Guo · Yixiang Huang · Ming Wu · Zhenwei Shi · Chuang Zhang · Jun Guo

[ ExHall D ]

Abstract
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are available in the supplementary materials.
Poster
Qi Zang · Dong Zhao · Shuang Wang · Dou Quan · Licheng Jiao · Zhun Zhong

[ ExHall D ]

Abstract
Change detection (CD) holds significant implications for Earth observation, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing methods mainly regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). Nevertheless, their efforts are limited by the complexity of optimizing GANs and the absence of guidance from physical properties. This paper finds that the spectrum transformation (ST) has the potential to mitigate pseudo-changes by aligning in the frequency domain carrying the style. However, the benefit of ST is largely constrained by two drawbacks: 1) limited transformation space and 2) inefficient parameter search. To address these limitations, we propose a Feature Spectrum learning (FeaSpect) that adaptively eliminate pseudo-changes in the latent space. For the drawback 1), FeaSpect directs the transformation towards style-aligned discriminative features via feature spectrum transformation (FST). For the drawback 2), FeaSpect allows FST to be trainable, efficiently discovering optimal parameters via extraction box with adaptive attention and extraction box with learnable strides. Extensive experiments on challenging datasets demonstrate that our method remarkably outperforms existing methods and achieves a commendable trade-off between accuracy and efficiency. Importantly, our method can …
Poster
Yinghui Xing · Qu Li Tao · Shizhou Zhang · Di Xu · YingkunYang · Yanning Zhang

[ ExHall D ]

Abstract
Pansharpening aims at integrating complementary information from panchromatic and multispectral images. Available deep-learning based pansharpening methods typically perform exceptionally with particular satellite datasets. At the same time, it has been observed that these models also exhibit scene dependence, for example, if the majority of the training samples come from the urban scenes, the model's performance may decline in the river scene. To address the domain gap produced by varying satellite sensors and distinct scenes, we propose a dual-granularity semantic guided sparse routing diffusion model for general pansharpening. By utilizing the large Vision Language Models (VLMs) in the field of geoscience, i.e., GeoChat, we introduce the dual granularity semantics to generate dynamic sparse routing scores for adaptation of different satellite sensors and scenes. These scene-level and region-level dual-granularity semantic information serve as guidance to dynamically activating specialized experts within the diffusion model. Extensive experiments on WorldView-3, QuickBird, and GaoFen-2 datasets show the effectiveness of our proposed method. Notably, the proposed method outperforms the comparison approaches in adapting to new satellite sensors and scenes. The code will be available.
Poster
Jin-Liang Xiao · Ting-Zhu Huang · Liang-Jian Deng · Guang Lin · Zihan Cao · Chao Li · Qibin Zhao

[ ExHall D ]

Abstract
Hyperspectral pansharpening refers to fusing a panchromatic image (PAN) and a low-resolution hyperspectral image (LR-HSI) to obtain a high-resolution hyperspectral image (HR-HSI). Recently, guiding pre-trained diffusion models (DMs) has demonstrated significant potential in this area, leveraging their powerful representational abilities while avoiding complex training processes. However, these DMs are often trained on RGB images, not well-suited for pansharpening tasks, limited in adapting to the hyperspectral images. In this work, we propose a novel guided diffusion scheme with zero-shot guidance and neural spatial-spectral decomposition (NSSD) to iteratively generate the RGB detail image and map the RGB detail image to target HR-HSI. Specifically, zero-shot guidance employs an auxiliary neural network that trained only with a PAN and LR-HSI to guide pre-trained DMs in generating the RGB detail image, informed by specific prior knowledge. Then, NSSD establishes a spectral mapping from the generated RGB detail image to the final HR-HSI. Extensive experiments are conducted on Pavia, Washington DC, and Chukusei datasets to demonstrate that the proposed method significantly enhances the performance of DMs for hyperspectral pansharpening tasks, outperforming existing methods across multiple metrics and achieving improvements in visualization results.
Poster
Yuchen Wang · Hongyuan Wang · Lizhi Wang · Xin Wang · Lin Zhu · Wanxuan Lu · Hua Huang

[ ExHall D ]

Abstract
Existing single-image denoising algorithms often struggle to restore details when dealing with complex noisy images. The introduction of near-infrared (NIR) images offers new possibilities for RGB image denoising. However, due to the inconsistency between NIR and RGB images, the existing works still struggle to balance the contributions of two fields in the process of image fusion. In response to this, in this paper, we develop a cross-field Frequency Correlation Exploiting Network (FCENet) for NIR-assisted image denoising. We first propose the frequency correlation prior based on an in-depth statistical frequency analysis of NIR-RGB image pairs. The prior reveals the complementary correlation of NIR and RGB images in the frequency domain. Leveraging frequency correlation prior, we then establish a frequency learning framework composed of Frequency Dynamic Selection Mechanism (FDSM) and Frequency Exhaustive Fusion Mechanism (FEFM). FDSM dynamically selects complementary information from NIR and RGB images in the frequency domain, and FEFM strengthens the control of common and differential features during the fusion of NIR and RGB features. Extensive experiments on simulated and real data validate that our method outperforms various state-of-the-art (SOTA) methods in terms of image quality and computational efficiency. The code is available at https://github.com/11679-hub/11679.
Poster
Ning Ni · Libao Zhang

[ ExHall D ]

Abstract
Currently, the demand for higher video quality has grown significantly. However, satellite video has low resolution, complex motion, and weak textures. Haze interference further exacerbates the loss of motion information and texture details, hindering effective spatiotemporal feature fusion and fine-grained feature mining. This presents significant challenges for subsequent super-resolution (SR) reconstruction, especially at continuous scales. To address these problems, this paper models the double-degradation process of hazy low-quality satellite videos and proposes a novel network to learn the optimal joint degradation pattern (ODPNet) for continuous-scale SR of hazy satellite videos. First, we design a prior-based feature soft dehazing module to eliminate haze interference at the feature level. Second, we develop a spatiotemporal self-attention (SSA) to capture long-range feature dependencies, thereby achieving effective spatiotemporal feature fusion. Third, we devise a tri-branch cross-aggregation block (TCB) to enhance feature representations of weak textures in satellite videos by effectively aggregating contextual information. Finally, we propose a cross-scale feature Top-k selection Transformer (CFTST), which aims to adaptively select and aggregate cross-scale latent codes to learn feature representations of satellite videos at arbitrary resolutions, thus enabling continuous-scale SR. Experiments show that ODPNet outperforms existing methods and achieves a better balance between model parameters and performance.
Poster
Jiayi Fu · Siyu Liu · Zikun Liu · Chun-Le Guo · Hyunhee Park · Rui-Qi Wu · Guoqing Wang · Chongyi Li

[ ExHall D ]

Abstract
We propose a novel real-world image dehazing method, abbreviated as IPC-Dehaze, by leveraging the high-quality codebook prior encapsulated in a pre-train VQGAN. Apart from previous codebook-based methods that rely on one-shot decoding, our method utilizes high-quality codes obtained in the previous iteration to guide the prediction of the Code-Predictor in the subsequent iteration, improving code prediction accuracy and ensuring stable dehazing performance. Our idea stems from the observations that 1) the degradation of hazy images varies with haze density and scene depth, and 2) clear regions play crucial cues in restoring dense haze regions. However, it is nontrivial to progressively refine the obtained codes in subsequent iterations, owing to the difficulty in determining which codes should be retained or replaced at each iteration. Another key insight of our study is to propose Code-Critic to capture interrelations among codes. The Code-Critic is used to evaluate code correlations and then resample a set of codes with the highest mask scores, i.e., a higher score indicates that the code is more likely to be rejected, which helps retain more accurate codes and predict difficult ones. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods in real-world dehazing. Our code will be …
Poster
Lingshun Kong · Jiangxin Dong · Jinhui Tang · Ming-Hsuan Yang · Jinshan Pan

[ ExHall D ]

Abstract
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. While ViTs generally outperform CNNs by effectively capturing long-range dependencies and input-specific characteristics, their computational complexity increases quadratically with image resolution. This limitation hampers their practical application in high-resolution image restoration. In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) to visual data. In contrast to existing methods that employ several fixed-direction scanning for feature extraction, which significantly increases the computational cost, we develop an efficient visual scan block that applies various geometric transformations before each SSM-based module, capturing useful non-local information and maintaining high efficiency. In addition, to more effectively capture and represent local information, we propose an efficient discriminative frequency domain-based feedforward network (EDFFN) which can effectively estimate useful frequency information for latent clear image restoration. Extensive experimental results show that the proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.
Poster
Hanze Liu · Jiahong Fu · Qi Xie · Deyu Meng

[ ExHall D ]

Abstract
Self-supervised image denoising methods have garnered significant research attention in recent years, for this kind of method reduces the requirement of large training datasets.Compared to supervised methods, self-supervised methods rely more on the prior embedded in deep networks themselves. As a result, most of the self-supervised methods are designed with Convolution Neural Networks(CNNs) architectures, which well capture one of the most important image prior, translation equivariant prior. Inspired by the great success achieved by the introduction of translational equivariance, in this paper, we explore the way to further incorporate another important image prior. Specifically, we first apply high-accuracy rotation equivariant convolution to self-supervised image denoising. Through rigorous theoretical analysis, we have proved that simply replacing all the convolution layers with rotation equivariant convolution layers would modify the network into its rotation equivariant version.To the best of our knowledge, this is the first time that rotation equivariant image prior is introduced to self-supervised image denoising at the network architecture level with a comprehensive theoretical analysis of equivariance errors, whichoffers a new perspective to the field of self-supervised image denoising.Moreover, to further improve the performance, we design a new mask mechanism to fusion the output of rotation equivariant network and vanilla CNN-based …
Poster
Xuyi He · Yuhui Quan · Ruotao Xu · Hui Ji

[ ExHall D ]

Abstract
Structured artifacts are semi-regular, repetitive patterns that closely intertwine with genuine image content, making their removal highly challenging. In this paper, we introduce the Scale-Adaptive Deformable Transformer, an network architecture specifically designed to eliminate such artifacts from images. The proposed network features two key components: a scale-enhanced deformable convolution module for modeling local patterns with varying sizes, orientations, and distortions, and a scale-adaptive deformable attention mechanism for capturing long-range relationships among repetitive patterns with different sizes and non-uniform spatial distributions. Extensive experiments show that our network consistently outperforms state-of-the-art methods in several structured artifact removal tasks, including image deraining, image demoir\'eing, and image debanding.
Poster
Du CHEN · Tianhe Wu · Kede Ma · Lei Zhang

[ ExHall D ]

Abstract
Most full-reference image quality assessment (FR-IQA) models assume that the reference image is of perfect quality. However, this assumption is flawed because many reference images in existing IQA datasets are of subpar quality. Moreover, recent generative image enhancement methods are capable of producing images of higher quality than their original counterparts. These factors challenge the effectiveness and applicability of current FR-IQA models. To address this limitation, we build a large-scale IQA database, namely DiffIQA, which contains approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive FIdelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of the test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses existing FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on …
Poster
Eduard Zamfir · Zongwei Wu · Nancy Mehta · Yuedong Tan · Danda Paudel · Yulun Zhang · Radu Timofte

[ ExHall D ]

Abstract
Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference.We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity.To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.The source code will be made available upon acceptance.
Poster
Haina Qin · Wenyang Luo · Zewen Chen · Yufan Liu · Bing Li · Weiming Hu · libin wang · DanDan Zheng · Yuming Li

[ ExHall D ]

Abstract
Image restoration tasks, such as deblurring, denoising, and dehazing, typically require separate models for each degradation type, limiting their generalization in real-world scenarios where mixed or unknown degradations may occur. In this work, we propose \textbf{Defusion}, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rely on task-specific models or ambiguous text-based priors, Defusion constructs explicit \textbf{visual instructions} that align with the visual degradation patterns. These instructions are grounded by applying degradations to standardized visual elements, capturing intrinsic degradation features while agnostic to image semantics. Defusion then uses these visual instructions to guide a diffusion-based model that operates directly in the degradation space, where it reconstructs high-quality images by denoising the degradation effects with enhanced stability and generalizability. Comprehensive experiments demonstrate that Defusion outperforms state-of-the-art methods across diverse image restoration tasks, including complex and real-world degradations.
Poster
Zhu Li Bo · Jianze Li · Haotong Qin · Wenbo Li · Yulun Zhang · Yong Guo · Xiaokang Yang

[ ExHall D ]

Abstract
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be released
Poster
Isma Hadji · Mehdi Noroozi · Victor Escorcia · Anestis Zaganidis · Brais Martinez · Georgios Tzimiropoulos

[ ExHall D ]

Abstract
There has been immense progress recently in the visual quality of Stable Diffusion-based Super Resolution (SD-SR). However, deploying large diffusion models on computationally restricted devices such as mobile phones remains impractical due to the large model size and high latency. This is compounded for SR as it often operates at high res (e.g. 4K×3K). In this work, we introduce Edge-SD-SR, the first parameter efficient and low latency diffusion model for image super-resolution. Edge-SD-SR consists of ∼ 169M parameters, including UNet, encoder and decoder, and has a complexity of only ∼ 142 GFLOPs. To maintain a high visual quality on such low compute budget, we introduce a number of training strategies: (i) A novel conditioning mechanism on the low-resolution input, coined bidirectional conditioning, which tailors the SD model for the SR task. (ii) Joint training of the UNet and encoder, while decoupling the encodings of the HR and LR images and using a dedicated schedule. (iii) Finetuning the decoder using the UNet’s output to directly tailor the decoder to the latents obtained at inference time. Edge-SD-SR runs efficiently on device, e.g. it can upscale a 128×128 patch to 512×512 in 38 msec while running on a Samsung S24 DSP, and of …
Poster
Feiyang Shen · Hongping Gan

[ ExHall D ]

Abstract
Deep Unfolding Networks (DUNs) have risen to prominence due to their interpretability and superior performance for image Compressive Sensing (CS). However, existing DUNs still face significant issues, such as the insufficient representation capability of single-scale image information during the iterative reconstruction phase and loss of feature information, which fundamentally limit the further enhancement of image CS performance. In this paper, we propose Homotopy Unfolding Network (HUNet) for image CS, which enables phase-by-phase reconstruction of images along homotopy path. Specifically, each iteration step of the traditional homotopy algorithm is mapped to a Multi-scale Homotopy Iterative Module (MHIM), which includes U-shaped stacked Window-based Transformer Blocks capable of efficient feature extraction. Within the MHIM, we design the Deep Homotopy Continuation Strategy to ensure the interpretability of the homotopy algorithm and facilitate feature learning. Additionally, we introduce a Dual-path Feature Fusion Module to mitigate the loss of high-dimensional feature information during the transmission between iterative phases, thereby maximizing the preservation of details in the reconstructed image. Extensive experiments indicate that HUNet achieves superior image reconstruction results compared to existing state-of-the-art methods.
Poster
Dehong Kong · Fan Li · Zhixin Wang · Jiaqi Xu · Renjing Pei · Wenbo Li · Wenqi Ren

[ ExHall D ]

Abstract
Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. However, previous conditional control methods for U-Net-based diffusion models, such as ControlNet, are not well-suited for DiTs. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel DiT-based image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image prior conditioning branch and a dual prompting control branch. into the DiT with high training efficiency. More importantly, we believe that in image restoration, the image's textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, together with text prompts, greatly enhance the quality and fidelity of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance with broad applicability.
Poster
Jiaming Liu · Qi Zheng · Zihao Liu · Yilian Zhong · Peiye Liu · Tao Liu · Shusong Xu · Yanheng Lu · Sicheng Li · Dimin Niu · Yibo Fan

[ ExHall D ]

Abstract
Compression artifacts removal (CAR), an effective post-processing method to reduce compression distortion in edge-side codecs, demonstrates remarkable results by utilizing convolutional neural networks (CNNs) on high computational power cloud side. Traditional image compression reduces redundancy in the frequency domain, and we observed that CNNs also exhibit a bias in frequency domain when handling compression distortions. However, no prior research leverages this frequency bias to design compression methods tailored to CAR CNNs, or vice versa. In this paper, we present a synergistic design that bridges the gap between image compression and learnable compensation for CAR. Our investigation reveals that different compensation networks have varying effects on low and high-frequencies. Building upon these insights, we propose a pioneering redesign of the quantization process, a fundamental component in lossy image compression, to more effectively compress low-frequency information. Additionally, we devise a novel compensation framework that applies different neural networks for reconstructing different frequencies, incorporating a basis attention block to prioritize intentionally dropped low-frequency information, thereby enhancing the overall compensation. We instantiate two compensation networks based on this synergistic design and conduct extensive experiments on three image compression standards, demonstrating that our approach significantly reduces bitrate consumption while delivering high perceptual quality.
Poster
Beilin Chu · Xuan Xu · Xin Wang · Yufei Zhang · Weike You · Linna Zhou

[ ExHall D ]

Abstract
The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guIded Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.
Poster
Huayuan Ye · Shenzhuo Zhang · Shiqi Jiang · Jing Liao · Shuhang Gu · Dejun Zheng · Changbo Wang · Chenhui Li

[ ExHall D ]

Abstract
Image steganography can hide information in a host image and obtain a stego image that is perceptually indistinguishable from the original one. This technique has tremendous potential in scenarios like copyright protection, information retrospection, etc. Some previous studies have proposed to enhance the robustness of the methods against image disturbances to increase their applicability. However, they generally cannot achieve a satisfying balance between the steganography quality and robustness. Instead of image-in-image steganography, we focus on the issue of message-in-image embedding that is robust to various real-world image distortions. This task aims to embed information into a natural image and the decoding result is required to be completely accurate, which increases the difficulty of data concealing and revealing. Inspired by the recent developments in transformer-based vision models, we discover that the tokenized representation of image is naturally suitable for steganography task. In this paper, we propose a novel message embedding framework, called **R**obust **M**essage **Steg**anography (RMSteg), which is competent to hide message via QR Code in a host image based on an normalizing flow-based model. The stego image derived by our method has imperceptible changes and the encoded message can be accurately restored even if the image is printed out and …
Poster
Jingbo Lu · Leheng Zhang · Xingyu Zhou · Mu Li · Wen Li · Shuhang Gu

[ ExHall D ]

Abstract
Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present.The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding.Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models.However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data.In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model.Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.
Poster
Weinan Jia · Mengqi Huang · Nan Chen · Lei Zhang · Zhendong Mao

[ ExHall D ]

Abstract
Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D2iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an innovative combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with fine-grained regions correction …
Poster
Anubhav Jain · Yuya Kobayashi · Takashi Shibuya · Yuhta Takida · Nasir Memon · Julian Togelius · Yuki Mitsufuji

[ ExHall D ]

Abstract
Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate memorization. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs. This leads to the generation of non-memorized images that are high in image quality and well aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, opposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.
Poster
Lei Wang · Senmao Li · Fei Yang · Jianye Wang · Ziheng Zhang · Yuhan Liu · Yaxing Wang · Jian Yang

[ ExHall D ]

Abstract
The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method—termed “MaskUNet”— that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations.
Poster
Hui Zhang · Tingwei Gao · Jie Shao · Zuxuan Wu

[ ExHall D ]

Abstract
Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality.However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process.To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs.Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising.BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model.Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration.BlockDance-Ada dynamically allocates resources and provides superior content quality.Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25\% and 50\% while maintaining generation quality.
Poster
Xinyin Ma · Runpeng Yu · Songhua Liu · Gongfan Fang · Xinchao Wang

[ ExHall D ]

Abstract
In this paper, we introduce a novel self-distillation paradigm for improving the performance of diffusion models. Previous studies have shown that introducing a teacher to distill the diffusion model can enhance its sampling efficiency. We raise an intriguing question: can the diffusion model itself serve as its teacher to further improve the performance of itself? To this end, we propose a new paradigm called Self Step-Distillation (SSD). The core idea of SSD is to integrate the predictions or the intermediate activations of the diffusion model at each timestep with its preceding timestep through a fusion mechanism. We propose two forms, explicit SSD and implicit SSD (iSSD), to perform N-step to N-step distillation from the diffusion model itself to achieve improved image quality. We further elucidate the relationship between SSD and high-order solver, highlighting their underlying relationship. The effectiveness of SSD is validated through extensive experiments on diffusion transformers of various sizes and across different sampling steps. Our results show that this novel self-distillation paradigm can significantly enhance performance. Additionally, our method is compatible with the distillation method designed for few-step inference. Notably, with iSSD trained less than one epoch, we obtain a 32-step DiT-XL/2 achieving an FID of 1.99, outperforming …
Poster
Zhiwei Jia · Yuesong Nan · Huixi Zhao · Gengdai Liu

[ ExHall D ]

Abstract
Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, offering great flexibility in model alignment. However, it is challenging to apply existing RL methods to timestep-distilled DMs for ultra-fast (2-step) image generation.Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO towards improving 2-step image generation.Based on the insights, we propose to fine-tune DMs with learned differentiable surrogate rewards.Our method, named \textbf{LaSRO}, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance.LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation 2 steps for reward optimization, enhancing generalizability and efficiency.We show that LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including those based on PPO or DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights behind it.
Poster
Xin Ding · Lei Yu · Xin Li · Zhijun Tu · Hanting Chen · Jie Hu · Zhibo Chen

[ ExHall D ]

Abstract
Recent years have witnessed the great success of denoising diffusion samplers in improving the generative capability and sampling efficiency given a pre-trained diffusion model. However, most sampling schedulers in diffusion models lack the sampling dynamics and planning capability for future generation results, leading to suboptimal solutions. To overcome this, we propose the Reinforced Active Sampling Scheduler, termed RaSS, intending to find the optimal sampling trajectory by actively planning and adjusting the sampling steps for each sampling process in time. Concretely, RaSS divides the whole sampling process into five stages and introduces a reinforcement learning (RL) agent to continuously monitor the generated instance and perceive the potential generation results, thereby achieving optimal instance- and state-adaptive sampling steps decision. Meanwhile, a sampling reward is designed to assist the planning capability of the RL agent by balancing the sampling efficiency and generation quality. The RaSS is a plug-and-play module, which is applicable to multiple denoising diffusion samplers of diffusion models. Extensive experiments on different benchmarks have shown that our RaSS can consistently improve the generation quality and efficiency across various tasks, without introducing significant computational overhead.
Poster
Kai Wang · Mingjia Shi · YuKun Zhou · Zekai Li · Xiaojiang Peng · Zhihang Yuan · Yuzhang Shang · Hanwang Zhang · Yang You

[ ExHall D ]

Abstract
Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.
Poster
Rahul Ravishankar · Zeeshan Patel · Jathushan Rajasegaran · Jitendra Malik

[ ExHall D ]

Abstract
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute.
Poster
Yuqing Wang · Shuhuai Ren · Zhijie Lin · Yujin Han · Haoyuan Guo · Zhenheng Yang · Difan Zou · Jiashi Feng · Xihui Liu

[ ExHall D ]

Abstract
Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process.In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling.Our key insight is that the feasibility of parallel generation is closely tied to visual token dependencies - while tokens with weak dependencies can be generated in parallel, adjacent tokens with strong dependencies are hard to generate together, as independent sampling of strongly correlated tokens may lead to inconsistent decisions.Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Specifically, we first generate initial tokens in each region sequentially to establish the global structure, then enable parallel generation across distant regions while maintaining sequential generation within each region. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6× speedup with comparable quality and up to 9.5× speedup with minimal quality degradation across both image and video …
Poster
Chengyue Wu · Xiaokang Chen · Zhiyu Wu · Yiyang Ma · Xingchao Liu · Zizheng Pan · Wen Liu · Zhenda Xie · Xingkai Yu · Chong Ruan · Ping Luo

[ ExHall D ]

Abstract
We introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
Poster
Shenghai Yuan · Jinfa Huang · Xianyi He · Yunyang Ge · Yujun Shi · Liuhan Chen · Jiebo Luo · Li Yuan

[ ExHall D ]

Abstract
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into …
Poster
Weixi Feng · Chao Liu · Sifei Liu · William Yang Wang · Arash Vahdat · Weili Nie

[ ExHall D ]

Abstract
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives -- blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with a Large Language Model for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.
Poster
Jiazi Bu · Pengyang Ling · Pan Zhang · Tong Wu · Xiaoyi Dong · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang

[ ExHall D ]

Abstract
The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present ByTheWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, ByTheWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that ByTheWay significantly improves the quality of text-to-video generation with negligible additional cost.
Poster
Yuwei Guo · Ceyuan Yang · Anyi Rao · Chenlin Meng · Omer Bar-Tal · Shuangrui Ding · Maneesh Agrawala · Dahua Lin · Bo Dai

[ ExHall D ]

Abstract
Video inpainting, which aims to fill missing regions with visually coherent content, has emerged as a crucial technique for editing and virtual tour applications. While existing approaches achieve either visual consistency or text-guided generation, they often struggle to balance between coherence and creative diversity. In this work, we introduce VideoRepainter, a two-stage framework that first allows users to inpaint a keyframe using established image-level techniques, and then propagates the corresponding change to other frames. Our approach can leverage state-of-the-art image diffusion models for keyframe manipulation, thereby easing the burden of the video-inpainting process. To this end, we integrate an image-to-video model with a symmetric condition mechanism to address ambiguity caused by direct mask downsampling. We further explore efficient strategies for mask synthesis and parameter optimization to reduce costs in data processing and model training. Evaluations demonstrate our method achieves superior results in both visual fidelity and content diversity compared to existing approaches, providing a practical solution for high-quality video editing and creation.
Poster
Jaerin Lee · Daniel Jung · Kanggeon Lee · Kyoung Mu Lee

[ ExHall D ]

Abstract
We introduce SemanticDraw, a new paradigm of interactive content creation where high-quality images are generated in near real-time from given multiple hand-drawn regions, each encoding prescribed semantic meaning. In order to maximize the productivity of content creators and to fully realize their artistic imagination, it requires both quick interactive interfaces and fine-grained regional controls in their tools. Despite astonishing generation quality from recent diffusion models, we find that existing approaches for regional controllability are very slow (52 seconds for 512 x 512 image) while not compatible with acceleration methods such as LCM, blocking their huge potential in interactive content creation. From this observation, we build our solution for interactive content creation in two steps: (1) we establish compatibility between region-based controls and acceleration techniques for diffusion models, maintaining high fidelity of multi-prompt image generation with x 10 reduced number of inference steps, (2) we increase the generation throughput with our new multi-prompt stream batch pipeline, enabling low-latency generation from multiple, region-based text prompts on a single RTX 2080 Ti GPU. Our proposed framework is generalizable to any existing diffusion models and acceleration schedulers, allowing sub-second (0.64 seconds) image content creation application upon well-established image diffusion models. The demo application can …
Poster
Ryugo Morita · Stanislav Frolov · Brian Bernhard Moser · Takahiro Shirakawa · Ko Watanabe · Andreas Dengel · Jinjia Zhou

[ ExHall D ]

Abstract
Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
Poster
Ziheng Ouyang · Zhen Li · Qibin Hou

[ ExHall D ]

Abstract
Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.
Poster
Chunnan Shang · Zhizhong Wang · Hongwei Wang · Xiangming Meng

[ ExHall D ]

Abstract
Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer—each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to …
Poster
Ta-Ying Cheng · Prafull Sharma · Mark Boss · Varun Jampani

[ ExHall D ]

Abstract
Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting.
Poster
Zichen Liu · Yue Yu · Hao Ouyang · Qiuyu Wang · Ka Leong Cheng · Wen Wang · Zhiheng Liu · Qifeng Chen · Yujun Shen

[ ExHall D ]

Abstract
As a highly practical application, image editing encounters a variety of user demands and thus prioritizes excellent ease of use.In this paper, we unveil MagicQuill, an integrated image editing system designed to support users in swiftly actualizing their creativity.Our system starts with a streamlined yet functionally robust interface, enabling users to articulate their ideas (e.g., inserting elements, erasing objects, altering color, etc.) with just a few strokes.These interactions are then monitored by a multimodal large language model (MLLM) to anticipate user intentions in real time, bypassing the need for prompt entry.Finally, we apply the powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process the editing request with precise control.We will release the entire system to facilitate the community.
Poster
Yusuf Dalva · Kavana Venkatesh · Pinar Yanardag

[ ExHall D ]

Abstract
Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, understanding their inner workings remains a significant challenge due to their black box'' nature. Recent research has focused on identifying a representation space that facilitates semantic manipulation of generated images, but these models generally lack a GAN-like linear latent space, that allows straightforward control over image generation. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability of controlling the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work both offers a scalable and effective image editing approach and significantly enhances the interpretability of rectified flow transformers.
Poster
Jun Zhou · Jiahao Li · Zunnan Xu · Hanhui Li · Yiji Cheng · Fa-Ting Hong · Qin Lin · qinglin lu · Xiaodan Liang

[ ExHall D ]

Abstract
Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of visual language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative \textbf{F}ine-grained \textbf{I}nstruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. We employ a VLM to precisely localize the desired editing regions within complex scenes. To enhance the fine-grained visual perception capabilities of the VLM, we introduce additional region tokens that complement the holistic image features and are integrated into the user's instructions. Relying solely on the output of the Language Model (LLM) to guide the diffusion model may result in suboptimal editing outcomes.Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens …
Poster
Zhengyao Fang · Pengyuan Lyu · Jingjing Wu · Chengquan Zhang · Jun Yu · Guangming Lu · Wenjie Pei

[ ExHall D ]

Abstract
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code …
Poster
Mengtian Li · Jinshu Chen · Wanquan Feng · Bingchuan Li · Fei Dai · Songtao Zhao · Qian HE

[ ExHall D ]

Abstract
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
Poster
Atharva Sehgal · Patrick Yuan · Ziniu Hu · Yisong Yue · Jennifer J. Sun · Swarat Chaudhuri

[ ExHall D ]

Abstract
We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.
Poster
Zixuan Wang · DUO PENG · Feng Chen · Yuwei Yang · Yinjie Lei

[ ExHall D ]

Abstract
Image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, challenges in achieving control over generated images have underscored the need for the task of conditional image synthesis. Current methods for conditional image synthesis, nevertheless, remain limited, as they are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of fundamental condition units. This perspective allows us to develop a framework for modular conditional generation, significantly enhancing the model's adaptability to diverse conditional generation tasks and greatly expanding its application range. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to impose comprehensive geometric constraints that ensure adherence to spatial configuration of the layout condition. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply …
Poster
Feng Liang · Haoyu Ma · Zecheng He · Tingbo Hou · Ji Hou · Kunpeng Li · Xiaoliang Dai · Felix Juefei-Xu · Samaneh Azadi · Animesh Sinha · Peizhao Zhang · Peter Vajda · Diana Marculescu

[ ExHall D ]

Abstract
Video personalization, which generates customized videos using reference images, has gained significant attention.However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration.Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources.This challenge arises due to the lack of a mechanism to link each concept with its specific reference image.We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation.Additionally, we introduce concept embeddings to encode the order of reference images.Our approach, Movie Weaver, seamlessly weaves multiple concepts—including face, body, and animal images—into one video, allowing flexible combinations in a single model.The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
Poster
Xixi Hu · Keyang Xu · Bo Liu · Hongliang Fei · Qiang Liu

[ ExHall D ]

Abstract
Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Open-source models like Stable Diffusion 3 (SD3), Flux, and AuraFlow often struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for a pretrained RF model, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we adaptively control the strength of the overshooting for each image patch according to their attention score with the text content. We name the proposed sampler Attention Modulated Overshooting sampler (AMO). AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.
Poster
Shuya Yang · Shaozhe Hao · Yukang Cao · Kwan-Yee K. Wong

[ ExHall D ]

Abstract
Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.
Poster
Shufan Li · Konstantinos Kallidromitis · Akash Gokul · Zichun Liao · Yusuke Kato · Kazuki Kozuka · Aditya Grover

[ ExHall D ]

Abstract
We introduce OminiFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OminiFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities.
Poster
Enis Simsar · Thomas Hofmann · Federico Tombari · Pinar Yanardag

[ ExHall D ]

Abstract
Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.
Poster
Zhanhao Liang · Yuhui Yuan · Shuyang Gu · Bohan CHEN · Tiankai Hang · Mingxi Cheng · Ji Li · Liang Zheng

[ ExHall D ]

Abstract
Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that the diffusion models to focus on the subtle, fine-grained visual differences instead …
Poster
Harsh Rangwani · Aishwarya Agarwal · Kuldeep Kulkarni · R. Venkatesh Babu · Srikrishna Karanam

[ ExHall D ]

Abstract
Image composition and generation are processes where the artists need control over various parts of the generated images. However, the current state-of-the-art generation models, like Stable Diffusion, cannot handle fine-grained part-level attributes in the text prompts. Specifically, when additional attribute details are added to the base text prompt, these text-to-image models either generate an image vastly different from the image generated from the base prompt or ignore the attribute details. To mitigate these issues, we introduce PartComposer, a training-free method that enables image generation based on fine-grained part-level attributes specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartComposer first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right region. After obtaining part masks, we run a localized diffusion process in each part region based on fine-grained part attributes and combine them to produce the final image. All stages of PartComposer are based on repurposing a pre-trained diffusion model, which enables it to generalize across domains. We demonstrate the effectiveness of part-level control provided by PartComposer through qualitative visual examples …
Poster
Xin Xie · Dong Gong

[ ExHall D ]

Abstract
Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.
Poster
Stefan Andreas Baumann · Felix Krause · Michael Neumayr · Nick Stracke · Melvin Sevi · Vincent Tao Hu · Björn Ommer

[ ExHall D ]

Abstract
Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation.
Poster
Lital Binyamin · Yoad Tewel · Hilit Segev · Eran Hirsch · Royi Rassin · Gal Chechik

[ ExHall D ]

Abstract
Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that …
Poster
Feifei Li · Mi Zhang · Yiming Sun · Min Yang

[ ExHall D ]

Abstract
Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed.However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings.Despite their effectiveness on single-concept prompts, current methods still face challenges when as they struggle to precisely remove harmful concepts without disrupting the semantics of benign ones. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process.DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation.The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not …
Poster
Mingcheng Li · Xiaolu Hou · Ziyang Liu · Dingkang Yang · Ziyun Qian · Jiawei Chen · Jinjie Wei · Yue Jiang · Qingyao Xu · Lihua Zhang

[ ExHall D ]

Abstract
Diffusion models have shown excellent performance in text-to-image generation. However, existing methods often suffer from performance bottlenecks when dealing with complex prompts involving multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration based scene parsing module that generates an agent system containing multiple agents with different tasks using MLLMs to adequately extract multiple scene elements. In addition, Hierarchical Compositional diffusion utilizes Gaussian mask and filtering to achieve the refinement of bounding box regions and highlights objects through region enhancement for accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, which has a large advantage in complex scene generation. The code will be open-source on github.
Poster
Xiaoqian Shen · Mohamed Elhoseiny

[ ExHall D ]

Abstract
Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consistent characters and background synthesis across frames. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce \emph{StoryGPT-V}, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. First, we train a character-aware LDM, which takes character-augmented semantic embedding as input and includes the supervision of the cross-attention map using character segmentation masks, aiming to enhance character generation accuracy and faithfulness. In the second stage, we enable an alignment between the output of LLM and the character-augmented embedding residing in the input space of the first-stage model. This harnesses the reasoning ability of LLM to address ambiguous references and the comprehension capability to memorize the context. We conduct comprehensive experiments on two visual story visualization benchmarks. Our model reports superior quantitative …
Poster
Chengyou Jia · Changliang Xia · Zhuohang Dang · Weijia Wu · Hangwei Qian · Minnan Luo

[ ExHall D ]

Abstract
Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code and models will be publicly available.
Poster
Shitao Xiao · Yueze Wang · Junjie Zhou · Huaying Yuan · Xingrun Xing · Ruiran Yan · Chaofan Li · Shuting Wang · Tiejun Huang · Zheng Liu

[ ExHall D ]

Abstract
The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored.In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of the chain-of-thought mechanism.This work represents the first attempt at a general-purpose image generation model, and we will open-source the related resources to foster advancements in this field.
Poster
Dmitrii M Petrov · Pradyumn Goyal · Divyansh Shivashok · Yuanming Tao · Melinos Averkiou · Evangelos Kalogerakis

[ ExHall D ]

Abstract
We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts.ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape’s geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.
Poster
Jingxuan Wei · Cheng Tan · Qi Chen · Gaowei Wu · Siyuan Li · Zhangyang Gao · Linzhuang Sun · Bihui Yu · Ruifeng Guo

[ ExHall D ]

Abstract
We introduce the task of text-to-diagram generation, which focuses on creating structured visual representations directly from textual descriptions. Existing approaches in text-to-image and text-to-code generation lack the logical organization and flexibility needed to produce accurate, editable diagrams, often resulting in outputs that are either unstructured or difficult to modify. To address this gap, we introduce DiagramGenBenchmark, a comprehensive evaluation framework encompassing eight distinct diagram categories, including flowcharts, model architecture diagrams, and mind maps. Additionally, we present DiagramAgent, an innovative framework with four core modules—Plan Agent, Code Agent, Check Agent, and Diagram-to-Code Agent—designed to facilitate both the generation and refinement of complex diagrams. Our extensive experiments, which combine objective metrics with human evaluations, demonstrate that DiagramAgent significantly outperforms existing baseline models in terms of accuracy, structural coherence, and modifiability. This work not only establishes a foundational benchmark for the text-to-diagram generation task but also introduces a powerful toolset to advance research and applications in this emerging area.
Poster
Shivam Duggal · Yushi Hu · Oscar Michel · Aniruddha Kembhavi · William Freeman · Noah A. Smith · Ranjay Krishna · Antonio Torralba · Ali Farhadi · Wei-Chiu Ma

[ ExHall D ]

Abstract
Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.
Poster
Ming Li · Jike Zhong · Tianle Chen · Yuxiang Lai · Konstantinos Psounis

[ ExHall D ]

Abstract
Recent studies on large language models (LLMs) and large multimodal models (LMMs) have demonstrated promising skills in various domains including science and mathematics. However, their capability in more challenging and real-world related scenarios like engineering has not been systematically studied. To bridge this gap, we propose EEE-Bench, a multimodal benchmark aimed at assessing LMMs' capabilities in solving practical engineering tasks, using electrical and electronics engineering (EEE) as the testbed. Our benchmark consists of 2860 hand-picked and carefully curated problems spanning 10 essential subdomains such as analog circuits, control systems, etc. Compared to other domains, engineering problems are intrinsically 1) more visually complex and versatile and 2) less deterministic in solutions. Successful solutions to these problems often demand more-than-usual rigorous integration of visual and textual information as models need to understand intricate images like abstract circuits and system diagrams while taking professional instructions. Alongside EEE-Bench, we provide extensive quantitative evaluations, fine-grained analysis, and improvement methods using 17 widely-used open- and closed-sourced LLMs and LMMs and 7 popular prompting techniques. Our results reveal notable deficiencies in current foundation models for EEE, including an average performance ranging from 19.48\% to 46.78\% and a tendency toward laziness" in overlooking essential visual context. In summary, …
Poster
Haoyu Wang · Le Wang · Sanping Zhou · Jingyi Tian · Zheng Qin · Yabing Wang · Gang Hua · Wei Tang

[ ExHall D ]

Abstract
Embodied localization based on vision and natural language dialogues presents a persistent challenge in embodied intelligence. Existing methods often approach this task as an image translation problem, leveraging encoder-decoder architectures to predict heatmaps. However, these methods frequently experience a deficiency in accuracy, largely due to their heavy reliance on resolution. To address this issue, we introduce CGD, a novel framework that utilizes causality guided diffusion model to directly model coordinate distributions. Specifically, CGD employs a denoising network to regress coordinates, while integrating causal learning modules, namely back-door adjustment (BDA) and front-door adjustment (FDA) to mitigate confounders during the diffusion process. This approach reduces the dependency on high resolution for improving accuracy, while effectively minimizing spurious correlations, thereby promoting unbiased learning. By guiding the denoising process with causal adjustments, CGD offers flexible control over intensity, ensuring seamless integration with diffusion models. Experimental results demonstrate that CGD outperforms state-of-the-art methods across all metrics. Additionally, we also evaluate CGD in a multi-shot setting, achieving consistently high accuracy.
Poster
Eunji Kim · Siwon Kim · Minjun Park · Rahim Entezari · Sungroh Yoon

[ ExHall D ]

Abstract
Recent advancements in text-to-image models, such as Stable Diffusion, show significant demographic biases. Existing de-biasing techniques rely heavily on additional training, which imposes high computational costs and risks of compromising core image generation functionality. This hinders them from being widely adopted to real-world applications. In this paper, we explore Stable Diffusion's overlooked potential to reduce bias without requiring additional training. Through our analysis, we uncover that initial noises associated with minority attributes form minorityregionsratherthanscaered.Weviewtheseminority regions' as opportunities in SD to reduce bias. To unlock the potential, we propose a novel de-biasing method called `weak guidance,' carefully designed to guide a random noise to the minority regions without compromising semantic integrity. Through analysis and experiments on various versions of SD, we demonstrate that our proposed approach effectively reduces bias without additional training, achieving both efficiency and preservation of core image generation functionality.
Poster
Mengfei Xia · Nan Xue · Yujun Shen · Ran Yi · Tieliang Gong · Yong-Jin Liu

[ ExHall D ]

Abstract
Classifier-Free Guidance (CFG), which combines the conditional and unconditional score functions with two coefficients summing to one, serves as a practical technique for diffusion model sampling. Theoretically, however, denoising with CFG cannot be expressed as a reciprocal diffusion process, which may consequently leave some hidden risks during use. In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (*i.e.*, the widely used summing-to-one version) brings about expectation shift of the generative distribution. To rectify this issue, we propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory. We further show that our approach enjoys a **closed-form** solution given the guidance strength. That way, the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected. Empirical evidence on real-world data demonstrate the compatibility of our post-hoc design with existing state-of-the-art diffusion models, including both class-conditioned ones (*e.g.*, EDM2 on ImageNet) and text-conditioned ones (*e.g.*, SD3 on CC12M), without any retraining. We will open-source the code to facilitate further research.
Poster
Lijun Li · Zhelun Shi · Xuhao Hu · Bowen Dong · Yiran Qin · Xihui Liu · Lu Sheng · Jing Shao

[ ExHall D ]

Abstract
Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing.
Poster
Naveen George · Karthik Nandan Dasaraju · Rutheesh Reddy Chittepu · Konda Reddy Mopuri

[ ExHall D ]

Abstract
Text-to-image models such as Stable Diffusion, DALL·E, and Midjourney have gained immense popularity lately. However, they are trained on vast amounts of data that may include private, explicit, or copyrighted material used without permission, raising serious legal and ethical concerns. In light of the recent regulations aimed at protecting individual data privacy, there has been a surge in Machine Unlearning methods designed to remove specific concepts from these models. However, we identify a critical flaw in these unlearning techniques: unlearned concepts will revive when the models are fine-tuned, even with general or unrelated prompts. In this paper, for the first time, through an extensive study, we demonstrate the unstable nature of existing unlearning methods in text-to-image diffusion models. We introduce a framework that includes a couple of measures for analyzing the stability of existing unlearning methods. Further, the paper offers preliminary insights into the plausible explanation for the instability of the mapping-based unlearning methods that can guide future research toward more robust unlearning techniques. Anonymized codes for implementing the proposed framework are provided.
Poster
Ding Qi · Jian Li · Junyao Gao · Shuguang Dou · Ying Tai · Jianlong Hu · Bo Zhao · Yabiao Wang · Chengjie Wang · Cai Rong Zhao

[ ExHall D ]

Abstract
Dataset distillation (DD) condenses key information from large-scale datasets into smaller synthetic datasets, reducing storage and computational costs for training networks. However, recent research has primarily focused on image classification tasks, with limited expansion to detection and segmentation. Two key challenges remain: (i) Task Optimization Heterogeneity, where existing methods focus on class-level information and fail to address the diverse needs of detection and segmentation and (ii) Inflexible Image Generation, where current generation methods rely on global updates for single-class targets and lack localized optimization for specific object regions.To address these challenges, we propose a universal dataset distillation framework, named UniDD, a task-driven diffusion model for diverse DD tasks, as illustrated in Fig.1. Our approach operates in two stages: Universal Task Knowledge Mining, which captures task-relevant information through task-specific proxy model training, and Universal Task-Driven Diffusion, where these proxies guide the diffusion process to generate task-specific synthetic images.Extensive experiments across ImageNet-1K, Pascal VOC, and MS COCO demonstrate that UniDD consistently outperforms state-of-the-art methods. In particular, on ImageNet-1K with IPC-10, UniDD surpasses previous diffusion-based methods by 6.1\%, while also reducing deployment costs.
Poster
Peter Sushko · Ayana Bharadwaj · Zhi Yang Lim · Vasily Ilin · Ben Caffee · Dongping Chen · Reza Salehi · Cheng-Yu Hsieh · Ranjay Krishna

[ ExHall D ]

Abstract
Existing image editing models struggle to meet real-world demands; despite excelling in academic benchmarks, we are yet to see them adopted to solve real user needs. The datasets that power these models use artificial edits, lacking the scale and ecological validity necessary to address the true diversity of user requests. In response, we introduce RealEdit, a large-scale image editing dataset with authentic user requests and human-made edits sourced from Reddit. RealEdit contains a test set of 9.3K examples the community can use to evaluate models on real user requests. Our results show that existing models fall short on these tasks, implying a need for realistic training data.So, we introduce 48K training examples, with which we train our RealEdit model. Our model achieves substantial gains—outperforming competitors by up to 165 Elo points in human judgment and 92% relative improvement on the automated VIEScore metric on our test set. We deploy our model back on Reddit, testing it on new requests, and receive positive feedback. Beyond image editing, we explore RealEdit's potential in detecting edited images by partnering with a deepfake detection non-profit. Finetuning their model on RealEdit data improves its F1-score by 14 percentage points, underscoring the dataset's value for broad, …
Poster
long xu · Jiakai Wang · Haojie Hao · Haotong Qin · Jiejie Zhao · Xianglong Liu

[ ExHall D ]

Abstract
Though achieving significant success in personalized image synthesis, Latent Diffusion models (LDMs) pose substantial social risks caused by unauthorized misuse (e.g., face theft). To counter these threats, the Anti-Customization (AC) method that exploits adversarial perturbations has been proposed.Unfortunately, existing AC methods show insufficient defense ability due to the ignorance to hierarchical characteristics, i.e., global feature correlations and local facial attribute, leading to weak resistance to concept transfer and semantic theft from customization methods. To address this problem, we are motivated to propose a **G**lobal-l**o**cal c**o**llaborate**d** Anti-Customization (GoodAC) framework to generate powerful adversarial perturbations by disturbing both feature correlations and facial attributes.For enhancing the ability to resist concept transfer, we disrupt the spatial correlation of perceptual features that form the basis of model generation at a global level, thereby creating highly concept-transfer-resistant adversarial camouflage.To improve the ability to resist semantic theft, leveraging the fact that facial attributes are personalized, we designed a personalized and precise facial attribute distortion strategy locally, focusing the attack on the individual's image structure to generate strong camouflage.Extensive experiments on various LDMs, including Dreambooth, LoRA, and textual inversion, have strongly demonstrated that our GoodAC outperforms other state-of-the-art approaches by large margins, e.g., over 50\% improvements on ISM.
Poster
Haonan An · Guang Hua · Zhengru Fang · Guowen Xu · Susanto Rahardja · Yuguang Fang

[ ExHall D ]

Abstract
The intellectual property of deep image-to-image models can be protected by the so-called box-free watermarking. It uses an encoder and a decoder, respectively, to embed into and extract from the model's output images invisible copyright marks. Prior works have improved watermark robustness, focusing on the design of better watermark encoders. In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. To defend against such an attack, we propose the decoder gradient shield (DGS) as a protection layer in the decoder API to prevent gradient-based watermark removal with a closed-form solution. The fundamental idea is inspired by the classical adversarial attack, but is utilized for the first time as a defensive mechanism in the box-free model watermarking. We then demonstrate that DGS can reorient and rescale the gradient directions of watermarked queries and stop the watermark remover's training loss from converging to the level without DGS, while retaining decoder output image quality. Experimental results verify the effectiveness of proposed method. Code of paper will be made available upon acceptance.
Poster
Yuan Gan · Jiaxu Miao · Yunze Wang · Yi Yang

[ ExHall D ]

Abstract
Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques.
Poster
Zexi Jia · Chuanwei Huang · Yeshuang Zhu · Hongyan Fei · Xiaoyue Duan · Yuan Zhiqiang · Ying Deng · Jiapei Zhang · Jinchao Zhang · Jie Zhou

[ ExHall D ]

Abstract
The rapid advancement of Generative Adversarial Networks (GANs) and diffusion models significantly enhances the realism of synthetic images, driving progress in image processing and creative design. However, this progress also necessitates the development of effective detection methods, as synthetic images become increasingly difficult to distinguish from real ones. This difficulty leads to various societal issues, such as the spread of misinformation, identity theft, and online fraud. While previous detection methods perform well on public benchmarks, they struggle with our proposed benchmark, FakeART, particularly when dealing with the latest models and cross-domain tasks (e.g., photo-to-painting). To address this challenge, we develop a new synthetic image detection technique based on color distribution. Unlike real images, synthetic images often exhibit uneven color distribution. By employing color quantization and restoration techniques, we analyze the color differences before and after image restoration. We discover and prove that these differences closely relate to the uniformity of color distribution. Based on this finding, we extract effective color features and combine them with image features to create a detection model with only 1.4 million parameters. This model achieves state-of-the-art results across various evaluation benchmarks, including the challenging FakeART dataset.
Poster
Siyuan Cheng · Lingjuan Lyu · Zhenting Wang · Xiangyu Zhang · Vikash Sehwag

[ ExHall D ]

Abstract
With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, CO-SPY, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create CO-SPYBench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%.
Poster
Ian Huang · Yanan Bao · Karen Truong · Howard Zhou · Cordelia Schmid · Leonidas Guibas · Alireza Fathi

[ ExHall D ]

Abstract
Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.
Poster
Chamin Hewa Koneputugodage · Yizhak Ben-Shabat · Sameera Ramasinghe · Stephen Gould

[ ExHall D ]

Abstract
Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.
Poster
LO-WEI TAI · Ching-En Ching En, Li · Cheng-Lin Chen · Chih-Jung Tsai · Hwann-Tzong Chen · Tyng-Luh Liu

[ ExHall D ]

Abstract
Principal Component Analysis (PCA), a classical dimensionality reduction technique, and Gaussian Splatting, a recent high-quality image synthesis method, represent fundamentally different approaches to image representation. Despite these significant differences, we present EigenGS, a novel method that bridges these two paradigms. By establishing an efficient transformation pipeline between eigenspace and image-space Gaussian representations, our approach enables instant initialization of Gaussian parameters for new images without requiring per-image training from scratch. Our method also introduces a frequency-aware learning mechanism that encourages Gaussians to adapt to different scales in order to better model spatial frequencies, effectively preventing artifacts in high-resolution reconstruction. Extensive experiments demonstrate that EigenGS not only achieves superior reconstruction quality but also dramatically accelerates convergence. The results highlight EigenGS's effectiveness and its ability to generalize across images with varying resolutions and diverse categories. This makes high-quality Gaussian Splatting practically viable for real-time applications.
Poster
Ruoyu Xue · Jingyi Xu · Sounak Mondal · Hieu Le · Gregory Zelinsky · Minh Hoai · Dimitris Samaras

[ ExHall D ]

Abstract
A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanapth prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each user's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time.
Poster
Pierre Vuillecard · Jean-marc Odobez

[ ExHall D ]

Abstract
Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-SWGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.
Poster
Zhifeng Xie · Qile He · Youjia Zhu · Qiwei He · Mengtian Li

[ ExHall D ]

Abstract
In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film—audio quality, musicality, and musical development—and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,000 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and …
Poster
Saksham Kushwaha Kushwaha · Yapeng Tian

[ ExHall D ]

Abstract
Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic …
Poster
Hyeonggon Ryu · Seongyu Kim · Joon Chung · Arda Senocak

[ ExHall D ]

Abstract
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a "mix-and-separate" framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources.Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves state-of-the-art performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.
Poster
Ruohao Guo · Xianghua Ying · Yaru Chen · Dantong Niu · Guangyao Li · Liao Qu · Yanyu Qi · Jinxing Zhou · Bowei Xing · Wenzhen Yue · Ji Shi · Qixun Wang · Peiliang Zhang · Buwen Liang

[ ExHall D ]

Abstract
In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models; however, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding.
Poster
Yung-Hsuan Lai · Janek Ebbers · Yu-Chiang Frank Wang · François Germain · Michael J. Jones · Moitreya Chatterjee

[ ExHall D ]

Abstract
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both unimodal events, i.e., those occurring either exclusively in the visual or acoustic modalities of a video, and multimodal events, i.e., those occurring in both modalities concurrently. Moreover, the prohibitive cost of annotating the training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, e.g. where only modality-agnostic, video-level labels might be assumed to be available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide the training of these methods. However, the lack of inter-segment consistency of these pseudo-labels and the general bias towards predicting labels that are absent in a segment, limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV).Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical evaluations show that UWAV outperforms the current state-of-the-art for the AVVP task on multiple metrics, across two different datasets, attesting to …
Poster
Bo Fang · Wenhao Wu · Qiangqiang Wu · YuXin Song · Antoni B. Chan

[ ExHall D ]

Abstract
Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose **DistinctAD**, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.
Poster
Kumar Ashutosh · Tushar Nagarajan · Georgios Pavlakos · Kris Kitani · Kristen Grauman

[ ExHall D ]

Abstract
Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate _actionable feedback_ from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching---expert commentary, expert video retrieval, and expert pose generation---outperforming strong vision-language models on both established metrics and human preference studies. Code and data will be publicly released.
Poster
Rong Gao · Xin Liu · Zhuozhao Hu · Bohao Xing · Baiqiang XIA · Zitong YU · Heikki Kälviäinen

[ ExHall D ]

Abstract
Figure skating, known as the “Art on Ice,” is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models’ understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.
Poster
Yuying Ge · Yizhuo Li · Yixiao Ge · Ying Shan

[ ExHall D ]

Abstract
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned …
Poster
Tianyi Xiong · Xiyao Wang · Dong Guo · Qinghao Ye · Haoqi Fan · Quanquan Gu · Heng Huang · Chunyuan Li

[ ExHall D ]

Abstract
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (i) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
Poster
Yiping Wang · Xuehai He · Kuan Wang · Luyao Ma · Jianwei Yang · Shuohang Wang · Simon Shaolei Du · yelong shen

[ ExHall D ]

Abstract
The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in specific short stories, which is foreseeable an essential capability for future long video generation scenarios. While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation.To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2–4 consecutive events. We employ Vision-Language Models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50\%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.
Poster
Zihui Xue · Joungbin An · Xitong Yang · Kristen Grauman

[ ExHall D ]

Abstract
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
Poster
Tengda Han · Dilara Gokay · Joseph Heyward · Chuhan Zhang · Daniel Zoran · Viorica Patraucean · Joao Carreira · Dima Damen · Andrew Zisserman

[ ExHall D ]

Abstract
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms.When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance.We demonstrate the drop in performance when moving from shuffled to sequential learning on three systems: the one-video representation learning method DoRA, standard VideoMAE, and the task of future video prediction.To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training.The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW.Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks.On three scenarios (DoRA, VideoMAE, future prediction),we show our orthogonal optimizer outperforms the strong AdamW all three cases.
Poster
Jihan Yang · Shusheng Yang · Anjali W. Gupta · Rilyn Han · Li Fei-Fei · Saining Xie

[ ExHall D ]

Abstract
Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also "think in space" from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance awareness.
Poster
Jungin Park · Jiyoung Lee · Kwanghoon Sohn

[ ExHall D ]

Abstract
View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple perspectives. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel fine-grained view-invariant video representation learning from unpaired ego-exo videos, called Bootstrap Your Own Videos (BYOV). We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. To this end, we introduce a masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment. Specifically, self-causal masking and cross-view masking predictions are learned concurrently to facilitate view-invariant and powerful representations across viewpoints. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at \url{https://anonymous.4open.science/r/byov-D967.
Poster
Bozheng Li · Yongliang Wu · YI LU · Jiashuo Yu · Licheng Tang · Jiawang Cao · Wenqing Zhu · Yuyang Sun · Jay Wu · Wenbo Zhu

[ ExHall D ]

Abstract
Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (\textbf{V}ideo \textbf{E}diting \textbf{U}nderstanding \textbf{Bench}mark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars\footnote{Named after the Academy Awards.}, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3\% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on …
Poster
Hongyeob Kim · Inyoung Jung · Dayoon Suh · Youjia Zhang · Sangmin Lee · Sungeun Hong

[ ExHall D ]

Abstract
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.
Poster
Junjie Zhou · Yan Shu · Bo Zhao · Boya Wu · Zhengyang Liang · Shitao Xiao · Minghao Qin · Xi Yang · yongping xiong · Bo Zhang · Tiejun Huang · Zheng Liu

[ ExHall D ]

Abstract
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding ability, and the choice …
Poster
Kai Hu · Feng Gao · Xiaohan Nie · Peng Zhou · Son Dinh Tran · Tal Neiman · Lingyun Wang · Mubarak Shah · Raffay Hamid · Bing Yin · Trishul Chilimbi

[ ExHall D ]

Abstract
Recent advances in \acf{mllms} show promising results in video reasoning. Popular \ac{mllm} frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an \ac{mllm}, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream \ac{mllm} may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight \ac{mllm}-based frame selection method that adaptively select frames that are more relevant to users' queries. The selected frames are then digested by a frozen downstream \acf{videollm} for visual reasoning and question answering. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a \ac{mllm}; (ii) Temporal signal, in which multiple frames selection by prompting \ac{llm} using the captions of all frame candidates. Empirical results show that the proposed \ac{mllm} video frame selector improves the performances various downstream \ac{videollm} across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.
Poster
Minjoon Jung · Junbin Xiao · Byoung-Tak Zhang · Angela Yao

[ ExHall D ]

Abstract
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be publicly released.
Poster
Chaoyu Li · Eun Woo Im · Pooyan Fazli

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks. Both the VidHalluc benchmark and DINO-HEAL code will be publicly released.
Poster
Anxhelo Diko · Tinghuai Wang · Wassim Swaileh · Shiyan Sun · Ioannis Patras

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) are crucial for real-world applications that require understanding textual and visual information. However, existing VLMs face multiple challenges in processing long videos, including computational inefficiency, memory limitations, and difficulties maintaining coherent understanding across extended sequences. These issues stem partly from the quadratic scaling of self-attention w.r.t. number of tokens but also encompass broader challenges in temporal reasoning and information integration over long sequences. To address these challenges, we introduce ReWind, a novel two-stage framework for long video understanding. In the first stage, ReWind maintains a dynamic memory that stores and updates instruction-relevant visual information as the video unfolds.Memory updates leverage novel read and write mechanisms utilizing learnable queries and cross-attentions between memory contents and the input stream. This approach maintains low memory requirements as the cross-attention layers scale linearly w.r.t. number of tokens. In the second stage, the memory content guides the selection of a few relevant frames, represented at high spatial resolution, which are combined with the memory contents and fed into an LLM to generate the final answer. We empirically demonstrate ReWind's superiority in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score …
Poster
Kyungho Bae · Jinhyung Kim · Sihaeng Lee · Soonyoung Lee · Gunhee Lee · Jinwoo Choi

[ ExHall D ]

Abstract
In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding …
Poster
Yongliang Wu · Xinting Hu · Yuyang Sun · Yizhou Zhou · Wenbo Zhu · Fengyun Rao · Bernt Schiele · Xu Yang

[ ExHall D ]

Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be made publicly available.
Poster
Andong Deng · Zhongpai Gao · Anwesa Choudhuri · Benjamin Planche · Meng Zheng · Bin Wang · Terrence Chen · Chen Chen · Ziyan Wu

[ ExHall D ]

Abstract
Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6\% improvement in F1 score and 44.8\% in CIDEr on the YouCook2 benchmark and a 14.7\% increase in recall on the Charades-STA benchmark compared to the baseline.
Poster
Zichen Liu · Kunlun Xu · Bing Su · Xu Zou · Yuxin Peng · Jiahuan Zhou

[ ExHall D ]

Abstract
Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model’s ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to …
Poster
Enrico Pallotta · Sina Mokhtarzadeh Azar · Shuai Li · Olga Zatsarynna · Jürgen Gall

[ ExHall D ]

Abstract
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP against other video prediction methods on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality, and demonstrate modality-agnostic generalization on SYNTHIA with semantic segmentation. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where depth conditioning is absent, demonstrating its robustness and potential for a wide range of applications.
Poster
Hao Du · Bo Wu · Yan Lu · Zhendong Mao

[ ExHall D ]

Abstract
Vision-language temporal alignment is a crucial capability for human recognition and cognition in real-world scenarios. Although existing works have designed methods to capture vision-language correlations, they are limited by benchmark issues, including biased temporal distributions, imprecise annotations, and inadequate compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary, we first present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective.To this end, we introduce SVLTA, a synthetic, large-scale, and compositional benchmark for vision-language temporal alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, process permutation, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.
Poster
Jirui Tian · Jinrong Zhang · Shenglan Liu · Luhao Xu · Zhixiong Huang · Gao Huang

[ ExHall D ]

Abstract
Existing multimodal large language models (MLLM) face significant challenges in Referring Video Object Segmentation(RVOS). We identify three critical challenges: (C1) insufficient quantitative representation of textual numerical data, (C2) repetitive and degraded response templates for spatiotemporal referencing, and (C3) loss of visual information in video sampling queries lacking textual guidance. To address these, we propose a novel framework, \textbf{Dynamic Time Object Sensing (DTOS)}, specifically designed for RVOS. To tackle (C1) and (C2), we introduce specialized tokens to construct multi-answer response templates, enabling regression of event boundaries and target localization. This approach improves the accuracy of numerical regression while mitigating the issue of repetitive degradation. To address (C3), we propose a Text-guided Clip Sampler (TCS) that selects video clips aligned with user instructions, preventing visual information loss and ensuring consistent temporal resolution. TCS is also applicable to Moment Retrieval tasks, with enhanced multimodal input sequences preserving spatial details and maximizing temporal resolution. DTOS demonstrates exceptional capability in flexibly localizing multiple spatiotemporal targets based on user-provided textual instructions. Extensive experiments validate the effectiveness of our approach, with DTOS achieving state-of-the-art performance in J&F scores: an improvement of +4.36 on MeViS, +4.48 on Ref-DAVIS17, and +3.02 on Ref-YT-VOS. Additionally, our TCS demonstrates exceptional performance …
Poster
Hao Fang · Runmin Cong · Xiankai Lu · Xiaofei Zhou · Sam Kwong · Wei Zhang

[ ExHall D ]

Abstract
Motion expression video segmentation aims to segment objects based on input motion descriptions. Compared with traditional referring video object segmentation, it focuses on motion and multi-object expressions and is more challenging. Previous works achieved it by simply injecting text information into the video instance segmentation (VIS) model. However, this requires retraining the entire model and optimization is difficult. In this work, we propose DMVS, a simple structure built on top of an off-the-shelf query-based VIS model, emphasizing decoupling the task into video instance segmentation and motion expression understanding. Firstly, we use an video instance segmenter as a means of distilling object-specific contexts into frame-level and video-level queries. Secondly, we interact two levels of queries with static and motion cues, respectively, to further encode visually enhanced motion expressions. Furthermore, we propose a novel query initialization strategy that uses video queries guided by classification priors to initialize motion queries, greatly reducing the difficulty of optimization. Without bells and whistles, DMVS achieves the state-of-the-art on the challenging MeViS dataset at a lower training cost. Extensive experiments verify the effectiveness and efficiency of our framework. The code will be publicly released.
Poster
Chong Zhou · Chenchen Zhu · Yunyang Xiong · Saksham Suri · Fanyi Xiao · Lemeng Wu · Raghuraman Krishnamoorthi · Bo Dai · Chen Change Loy · Vikas Chandra · Bilge Soran

[ ExHall D ]

Abstract
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM …
Poster
Huaxin Zhang · Xiaohao Xu · Xiang Wang · Jialong Zuo · Xiaonan Huang · Changxin Gao · Shanjun Zhang · Li Yu · Nong Sang

[ ExHall D ]

Abstract
How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts?Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies.To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments.For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy.Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos.Our benchmark and model will be publicly available.
Poster
Valentin Gabeff · Haozhe Qi · Brendan Flaherty · Gencer Sumbul · Alexander Mathis · Devis Tuia

[ ExHall D ]

Abstract
Monitoring wildlife is essential for ecology and especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife-environment interactions at scale with minimal disturbance. While computer vision models are becoming more powerful for general video understanding tasks, they struggle comparatively with camera trap videos. This gap in terms of performance and applicability can be partly attributed to the lack of annotated video datasets. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Behaviors were annotated at two levels of complexity: actions representing simple behaviors and high-level activities. Based on 6,135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. To enable future ecology research, we also propose a second benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive …
Poster
Mengnan Liu · Le Wang · Sanping Zhou · Kun Xia · Xiaolong Sun · Gang Hua

[ ExHall D ]

Abstract
Point-supervised Temporal Action Localization poses significant challenges due to the difficulty of identifying complete actions with a single-point annotation per action. Existing methods typically employ Multiple Instance Learning, which struggles to capture global temporal context and requires heuristic post-processing. In research on fully-supervised tasks, DETR-based structures have effectively addressed these limitations. However, it is nontrivial to merely adapt DETR to this task, encountering two major bottlenecks. (1) How to integrate point label information into the model and (2) How to select optimal decoder proposals for training in the absence of complete action segment annotations. To address this issue, we introduce an end-to-end framework by integrating Query Reformation and Optimal Transport (QROT). Specifically, we encode point labels through a set of semantic consensus queries, enabling effective focus on action-relevant snippets. Furthermore, we integrate an optimal transport mechanism to generate high-quality pseudo labels. These pseudo-labels facilitate precise proposals selection based on Hungarian algorithm, significantly enhancing localization accuracy in point-supervised settings. Extensive experiments on the THUMOS14 and ActivityNet-v1.3 datasets demonstrate that our method outperforms existing MIL-based approaches, offering more stable and accurate temporal action localization in point-level supervision. The code will be publicly available.
Poster
Anqi Zhu · Jingmin Zhu · James Bailey · Mingming Gong · Qiuhong Ke

[ ExHall D ]

Abstract
Skeleton-based human action recognition has emerged as a promising approach due to its privacy preservation, robustness to visual challenges, and computational efficiency. While current methods predominantly rely on fully supervised learning, the practical necessity to recognize unseen actions has led to increased interest in zero-shot skeleton-based action recognition (ZSSAR). Existing ZSSAR approaches often rely on manually crafted action descriptions and movement assumptions, limiting their flexibility across diverse action classes. To overcome this, we introduce Semantic-guided Cross-Model Prompt Learning (SCoPLe), a novel framework that replaces manual guidance with data-driven prompt learning for skeletal and textual knowledge refinement and alignment. Specifically, we introduce a dual-stream language prompting module that selectively preserves the original semantic context, effectively enhancing the prompting features. We also introduce a semantic-guided adaptive skeleton prompting module that learns joint-level prompts for skeleton features and incorporates an adaptive visual representation sampler that leverages text semantics to strengthen the cross-modal prompting interactions during skeleton-to-text embedding projection. Experimental results on the NTU-RGB+D 60, NTU-RGB+D 120, and PKU-MMD datasets demonstrate the state-of-the-art performance of our method in both ZSSAR and Generalized ZSSAR scenarios.
Poster
Hongkai Wei · YANG YANG · Shijie Sun · Mingtao Feng · Xiangyu Song · Qi Lei · Hongli Hu · Rong Wang · Huansheng Song · Naveed Akhtar · Ajmal Mian

[ ExHall D ]

Abstract
Visual-Language Tracking (VLT) is emerging as a promising paradigm to bridge the human-machine performance gap. For single objects, VLT broadens the problem scope to text-driven video comprehension. Yet, this direction is still confined to 2D spatial extents, currently lacking the ability to deal with 3D tracking in the confines of monocular video. Unfortunately, advances in 3D tracking mainly rely on expensive sensor inputs, e.g., point clouds, depth measurements, radar. Absence of language counterpart for the outputs of these mildly democratized sensors in the literature also hinders VLT expansion to 3D tracking. Addressing that, we make the first attempt towards extending VLT to 3D tracking based on monocular video. We present a comprehensive framework, introducing (i) the Monocular-Video-based 3D Visual Language Tracking (Mono3DVLT) task, (ii) a large-scale dataset for the task, called Mono3DVLT-V2X, and (iii) a customized neural model for the task. Our dataset is carefully curated, leveraging a Large Langauge Model (LLM) followed by human verification, composing natural language descriptions for 79,158 video sequences aiming at single object tracking, providing 2D and 3D bounding box annotations. Our neural model, termed Mono3DVLT-MT, is the first targeted approach for the Mono3DVLT task. Comprising the pipeline of multi-modal feature extractor, visual-language encoder, tracking …
Poster
Manfred Georg · Garrett Tanzer · Esha Uboweja · Saad Hassan · Maximus Shengelia · Sam Sepah · Sean Forbes · Thad Starner

[ ExHall D ]

Abstract
Progress in machine understanding of sign languages has been slow and hampered by limited data.In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments.Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers while more broadly capable technology develops.At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks---plausible optimizations to help with on-device performance.
Poster
Chanhui Lee · Yeonghwan Song · Jeany Son

[ ExHall D ]

Abstract
Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic adversarial attack that deceives deep neural networks using a single perturbation generated solely from random noise, without any data priors.However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic information in random noise.To address this, we propose a novel data-free universal attack approach that generates a pseudo-semantic prior recursively from the UAPs, enriching semantic contents within the data-free UAP framework.Our method is based on the observation that UAPs inherently contain latent semantic information, enabling the generated UAP to act as an alternative data prior, by capturing a diverse range of semantics through region sampling.We further introduce a sample reweighting technique to emphasize hard examples by focusing on samples that are less affected by the UAP.By leveraging the semantic information from the pseudo-semantic prior, we also incorporate input transformations, typically ineffective in data-free UAPs due to the lack of semantic content in random priors, to boost black-box transferability.Comprehensive experiments on ImageNet show that our method achieves state-of-the-art performance in average fooling rate by a substantial margin, significantly improves attack transferability across various CNN architectures compared to existing data-free UAP methods, and even surpasses data-dependent UAP methods.Extensive experiments …
Poster
Qian Wang · Chen Li · Yuchen Luo · Hefei Ling · Shijuan Huang · Ruoxi Jia · Ning Yu

[ ExHall D ]

Abstract
As a defense strategy against adversarial attacks, adversarial detection aims to identify and filter out adversarial data from the data flow based on discrepancies in distribution and noise patterns between natural and adversarial data. Although previous detection methods achieve high performance in detecting gradient-based adversarial attacks, new attacks based on generative models with imbalanced and anisotropic noise patterns evade detection. Even worse, the significant inference time overhead and limited performance against unseen attacks make existing techniques impractical for real-world use. In this paper, we explore the proximity relationship among adversarial noise distributions and demonstrate the existence of an open covering for these distributions. By training on the open covering of adversarial noise distributions, a detector with strong generalization performance against various types of unseen attacks can be developed. Based on this insight, we heuristically propose Perturbation Forgery, which includes noise distribution perturbation, sparse mask generation, and pseudo-adversarial data production, to train an adversarial detector capable of detecting any unseen gradient-based, generative-based, and physical adversarial attacks. Comprehensive experiments conducted on multiple general and facial datasets, with a wide spectrum of attacks, validate the strong generalization of our method.
Poster
Jikang Cheng · Zhiyuan Yan · Ying Zhang · Li Hao · Jiaxin Ai · Qin Zou · Chen Li · Zhongyuan Wang

[ ExHall D ]

Abstract
The rapid advancement of face forgery techniques has introduced a growing variety of forgeries.Incremental Face Forgery Detection (IFFD), involvinggradually adding new forgery data to fine-tune the previously trained model, has been introduced as a promising strategy to deal with evolving forgery methods.However, a naively trained IFFD model is prone to catastrophic forgetting when new forgeries are integrated, as treating all forgeries as a single ''Fake" class in the Real/Fake classification can cause different forgery types overriding one another, thereby resulting in the forgetting of unique characteristics from earlier tasks and limiting the model's effectiveness in learning forgery specificity and generality.In this paper, we propose to stack the latent feature distributions of previous and new tasks brick by brick, i.e., achieving aligned feature isolation. In this manner, we aim to preserve learned forgery information and accumulate new knowledge by minimizing distribution overriding, thereby mitigating catastrophic forgetting.To achieve this, we first introduce Sparse Uniform Replay (SUR) to obtain the representative subsets that could be treated as the uniformly sparse versions of the previous global distributions.We then propose a Latent-space Incremental Detector (LID) that leverages SUR data to isolate and align distributions. For evaluation, we construct a more advanced and comprehensive benchmark tailored …
Poster
Minchul Kim · Dingqiang Ye · Yiyang Su · Feng Liu · Xiaoming Liu

[ ExHall D ]

Abstract
Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest; (ii) Semantic Attention Head (SAH), an attention mechanism that learns pose-invariant representations by pooling features around key body parts; and (iii) a masked recognition model (MRM) that learns from variable token length. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.The dataset, code and models will be released.
Poster
Donghyun Lee · Yuhang Li · Youngeun Kim · Shiting Xiao · Priyadarshini Panda

[ ExHall D ]

Abstract
Spike-based Transformer presents a compelling and energy-efficient alternative to traditional Artificial Neural Network (ANN)-based Transformers, achieving impressive results through sparse binary computations. However, existing spike-based transformers predominantly focus on spatial attention while neglecting crucial temporal dependencies inherent in spike-based processing, leading to suboptimal feature representation and limited performance. To address this limitation, we propose Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture that efficiently integrates both spatial and temporal information in the self-attention mechanism. STAtten introduces a block-wise computation strategy that processes information in spatial-temporal chunks, enabling comprehensive feature capture while maintaining the same computational complexity as previous spatial-only approaches. Our method can be seamlessly integrated into existing spike-based transformers without architectural overhaul. Extensive experiments demonstrate that STAtten significantly improves the performance of existing spike-based transformers across both static and neuromorphic datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101.
Poster
Tianqing Zhang · Kairong Yu · Xian Zhong · Hongwei Wang · Qi Xu · Qiang Zhang

[ ExHall D ]

Abstract
Spiking Neural Networks (SNNs) have gained significant attention due to their biological plausibility and energy efficiency, making them promising alternatives to Artificial Neural Networks (ANNs). However, the performance gap between SNNs and ANNs remains a substantial challenge hindering the widespread adoption of SNNs. In this paper, we propose a Spatial-Temporal Attention Aggregator SNN (STAA-SNN) framework, which dynamically focuses on and captures both spatial and temporal dependencies. First, we introduce a spike-driven self-attention mechanism specifically designed for SNNs. Additionally, we pioneeringly incorporate position encoding to integrate latent temporal relationships into the incoming features. For spatial-temporal information aggregation, we employ step attention to selectively amplify relevant features at different steps. Finally, we implement a time-step random dropout strategy to avoid local optima. As a result, STAA-SNN effectively captures both spatial and temporal dependencies, enabling the model to analyze complex patterns and make accurate predictions. The framework demonstrates exceptional performance across diverse datasets and exhibits strong generalization capabilities. Notably, STAA-SNN achieves state-of-the-art results on neuromorphic datasets CIFAR10-DVS, with remarkable performances of 97.14%, 82.05% and 70.40% on the static datasets CIFAR-10, CIFAR-100 and ImageNet, respectively. Furthermore, our model exhibits improved performance ranging from 0.33% to 2.80% with fewer time steps. The code for the …
Poster
Soikat Hasan Ahmed · Jan Finkbeiner · Emre Neftci

[ ExHall D ]

Abstract
Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for robust object detection. While Spiking Neural Networks (SNNs) on neuromorphic hardware are often considered for energy efficient and low latency event-based data processing, they often fall short of Artificial Neural Networks (ANNs) in accuracy and flexibility. Here, we introduce Attention-based Hybrid SNN-ANN backbones for event-based object detection to leverage the strengths of both SNN and ANN architectures. A novel Attention-based SNN-ANN bridge module proposed to captures sparse spatial and temporal relations from the SNN layer and converts them into dense feature maps for the ANN part of the backbone. Additionally, we present a variant that integrates DWConvLSTMs to the ANN blocks to capture slower dynamics. This multi-timescale network combines fast SNN processing for short timesteps with long-term dense RNN processing, effectively capturing both fast and slow dynamics.Experimental results demonstrate that our proposed method surpasses SNN-based approaches by significant margins, with results comparable to existing ANN and RNN-based methods. Unlike ANN-only networks, the hybrid setup allow us to implement the SNN blocks on digital neuromorphic hardware to investigate the feasibility of our approach.Extensive ablation studies and implementation on neuromorphic hardware confirm the effectiveness of …
Poster
Xin Liang · Yogesh S. Rawat

[ ExHall D ]

Abstract
In this work, we focus on clothes-changing person re-identification (CC-ReID), which aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions.In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism that utilizes the separable nature of text descriptions as disentanglement supervision to partition the feature space into distinct subspaces, enabling the effective separation of identity-related features from non-biometric features through gradient reversal. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6\% on LTCC, 3.4\% on PRCC, 2.5\% …
Poster
Kaiyu Li · Ruixun Liu · Xiangyong Cao · Xueru Bai · Feng Zhou · Deyu Meng · Wang Zhi

[ ExHall D ]

Abstract
Current remote sensing semantic segmentation methods are mostly built on the close-set assumption, meaning that the model can only recognize pre-defined categories that exist in the training set. However, in practical Earth observation, there are countless unseen categories, and manual annotation is impractical. To address this challenge, we first attempt to introduce training-free open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle these issues, we propose a simple and universal upsampler, i.e. SimFeatUp, to restore lost spatial information of deep features. Specifically, SimFeatUp only needs to learn from a few unlabeled images, and can upsample arbitrary remote sensing image features. Furthermore, based on the observation of the abnormal response of patch tokens to the [CLS] token in CLIP, we propose to execute a simple subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets of 4 tasks, including semantic segmentation, building extraction, road detection, and flood detection. Our method achieves an average of 5.8\%, 8.2\%, 4.0\%, and 15.3\% improvement over state-of-the-art methods on the 4 …
Poster
Minsu Kim · Seungryong Kim · Kwanghoon Sohn

[ ExHall D ]

Abstract
Existing technique on domain adaptive person search commonly utilizes the unified framework for jointly localizing and identifying the person across domains. This framework, however, inevitably results in the gradient conflict problem, particularly in cross-domain scenarios with contradictory objectives, as the unified framework employs shared parameters to simultaneously address person detection and re-identification tasks across the domains. To overcome this, we present a novel mixture of submodules framework, dubbed MoS, that dynamically modulates the combination of submodules depending on the specific task to perform person detection and re-identification, separately. We further design the mixtures of submodules that vary depending on the domain, enabling domain-specific knowledge transfer. Especially, we decompose the main model into several submodules and employ diverse mixtures of submodules that vary depending on the tasks and domains through the conditional routing policy. In addition, we also present counterpart domain sample generation that synthesizes the augmented sample and uses them to learn domain invariant representation for person re-identification through the contrastive domain alignment. We conduct experiments to demonstrate the effectiveness of our MoS over the existing domain adaptive person search method and provide ablation studies.
Poster
Xiaofei Hui · Haoxuan Qu · Hossein Rahmani · Jun Liu

[ ExHall D ]

Abstract
Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast HOI images''. Extensive experiments demonstrate the efficacy of our framework.
Poster
Haoliang Meng · Xiaopeng Hong · Zhengqin Lai · Miao Shang

[ ExHall D ]

Abstract
This paper addresses multi-modal crowd counting with a novel 'free lunch' training enhancement strategy that requires no additional data, parameters, or increased inference complexity. First, we introduce a cross-modal alignment technique as a plug-in post-processing step for the pre-trained backbone network, enhancing the model’s ability to capture shared information across modalities. Second, we incorporate a regional density supervision mechanism during the fine-tuning stage, which differentiates features in regions with varying crowd densities. Extensive experiments on three multi-modal crowd counting datasets validate our approach, making it the first to achieve an MAE below 10 on RGBT-CC.
Poster
Ruibin Li · Tao Yang · Song Guo · Lei Zhang

[ ExHall D ]

Abstract
Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods …
Poster
Hao Zhu · Yan Zhu · Jiayu Xiao · Tianxiang Xiao · Yike Ma · Yucheng Zhang · Feng Dai

[ ExHall D ]

Abstract
Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright …
Poster
Chenxi Xie · Minghan LI · Hui Zeng · Jun Luo · Lei Zhang

[ ExHall D ]

Abstract
High-resolution semantic segmentation is essential for applications like image editing, bokeh imaging, and AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features with precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels form new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate …
Poster
Wonseok Roh · Hwanhee Jung · Giljoo Nam · Dong In Lee · Hyeongcheol Park · Sang Ho Yoon · Jungseock Joo · Sangpil Kim

[ ExHall D ]

Abstract
Recent 3D Instance Segmentation methods typically encode hundreds of instance-wise candidates with instance-specific information in various ways and refine them into final masks.However, they have yet to fully explore the benefit of these candidates.They overlook the valuable cues encoded in multiple candidates that represent different parts of the same instance, resulting in fragments.Also, they often fail to capture the precise spatial range of 3D instances, primarily due to inherent noises from sparse and unordered point clouds.In this work, to address these challenges, we propose a novel instance-wise knowledge enhancement approach.We first introduce an Instance-wise Knowledge Aggregation to associate scattered single instance details by optimizing correlations among candidates representing the same instance.Moreover, we present an Instance-wise Structural Guidance to enhance the spatial understanding of candidates using structural cues from ambiguity-reduced features.Here, we utilize a simple yet effective truncated singular value decomposition algorithm to minimize inherent noises of 3D features.In our extensive experiments on large-scale benchmarks, ScanNetV2, ScanNet200, S3DIS, and STPLS3D, our method outperforms existing works.We also demonstrate the effectiveness of our modules based on both kernel and transformer architectures.
Poster
Xinyu Zhao · Jun Xie · Shengzhe Chen · Jun Liu

[ ExHall D ]

Abstract
Multi-center star shape is a prevalent object shape feature, which has proven effective in model-based image segmentation methods. However, the shape field function induced by the multi-center star shape is non-smooth, and directly applying it to the data-driven image segmentation network architecture design may lead to instability in backpropagation. This paper proposes a convex combination star (CCS) shape, possessing multi-center star shape properties, and has the advantage of effectively controlling the shape of the region through a smooth field function. The sufficient condition of the proposed CCS shape can be combined into the image segmentation neural network structure design through the bridge between the variational segmentation model and the activation function of the data-driven method. Taking Segment Anything Model (SAM) and its improved version as backbone networks, we have shown that the segmentation network architecture with CCS shape properties can greatly improve the accuracy of segmentation results.
Poster
Haijie Li · Yanmin Wu · Jiarui Meng · Qiankun Gao · Zhiyao Zhang · Ronggang Wang · Jian Zhang

[ ExHall D ]

Abstract
3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation.In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation …
Poster
Andrew Szot · Bogdan Mazoure · Omar Attia · Aleksei Timofeev · Harsh Agrawal · R Devon Hjelm · Zhe Gan · Zsolt Kira · Alexander Toshev

[ ExHall D ]

Abstract
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.
Poster
Junha Lee · Chunghyun Park · Jaesung Choe · Yu-Chiang Frank Wang · Jan Kautz · Minsu Cho · Chris Choy

[ ExHall D ]

Abstract
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models (VLM), we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs—significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.
Poster
Xingchen Liu · Piyush Tayal · Jianyuan Wang · Jesus Zarzar · Tom Monnier · Konstantinos Tertikas · Jiali Duan · Antoine Toisoul · Jason Y. Zhang · Natalia Neverova · Andrea Vedaldi · Roman Shapovalov · David Novotny

[ ExHall D ]

Abstract
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360 coverage.uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories.It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations.Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds.In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction.We train several large 3D models on MVImgNet, CO3Dv2, and uCO3Dand obtain superior results using the latter, showing that uCO3D is better for learning applications.
Poster
Hongjia Zhai · Hai Li · Zhenzhe Li · Xiaokun Pan · Yijia He · Guofeng Zhang

[ ExHall D ]

Abstract
Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods can not distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and efficient 3D panoptic open vocabulary scene understanding approach. Technical-wise, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-planes to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding.
Poster
Yan Wang · Baoxiong Jia · Ziyu Zhu · Siyuan Huang

[ ExHall D ]

Abstract
Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks.
Poster
JUNSEONG KIM · GeonU Kim · Kim Yu-Ji · Yu-Chiang Frank Wang · Jaesung Choe · Tae-Hyun Oh

[ ExHall D ]

Abstract
We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. Code will be publicly available if accepted.
Poster
Hanxun Yu · Wentong Li · Song Wang · Junbo Chen · Jianke Zhu

[ ExHall D ]

Abstract
Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Our full implementation will be publicly available.
Poster
Shengqiong Wu · Hao Fei · Tat-seng Chua

[ ExHall D ]

Abstract
Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics.To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.
Poster
Jingzhou Luo · Yang Liu · weixing chen · Zhen Li · Yaowei Wang · Guanbin Li · Liang Lin

[ ExHall D ]

Abstract
3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet.
Poster
Shijie Zhou · Hui Ren · Yijia Weng · Shuwang Zhang · Zhen Wang · Dejia Xu · Zhiwen Fan · Suya You · Zhangyang Wang · Leonidas Guibas · Achuta Kadambi

[ ExHall D ]

Abstract
Recent advancements in 2D and multi-modal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into 4D realm, using only monocular video input, which is widely available from user-generated content. The X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single, task-dependent representation. Additionally, to the best of our knowledge, we are the first method to distill and lift the video foundation models (e.g. SAM2, InternVideo2) features into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in …
Poster
Zihan Wang · Gim Hee Lee

[ ExHall D ]

Abstract
We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks. Our source code and dataset will be made open-source upon paper acceptance.
Poster
Jianwei Yang · Reuben Tan · Qianhui Wu · Ruijie Zheng · Baolin Peng · Yongyuan Liang · Yu Gu · Mu Cai · Seonghyeon Ye · Joel Jang · Yuquan Deng · Jianfeng Gao

[ ExHall D ]

Abstract
This paper presents a new foundation model, called Magma, for multimodal AI agents in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that the former not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial intelligence) to complete agentic tasks ranging from UI navigation to robot manipulation. Magma is pre-trained on large amounts of heterogeneous VL datasets, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set of Marks (SoM) and the object movements (e.g., the trace of a robotic arm) in videos are labeled by Trace of Mark (ToM). Evaluation shows that SoM and ToM facilitate acquisition of spatial intelligence from training data. Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.
Poster
Jing Zhu · Yuhang Zhou · Shengyi Qian · Zhongmou He · Tong Zhao · Neil Shah · Danai Koutra

[ ExHall D ]

Abstract
Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structures remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph neural networks (GNNs). Our benchmark comprises seven diverse datasets of varying scales, designed to assess graph learning algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of GNN performance in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on the performance of various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning algorithms.
Poster
Zihao Zhang · Aming Wu · Yahong Han

[ ExHall D ]

Abstract
Recently, a task of Single-Domain Generalized Object Detection (Single-DGOD) is proposed, aiming to generalize a detector to multiple unknown domains never seen before during training. Due to the unavailability of target-domain data, some methods leverage the multimodal capabilities of vision-language models, using textual prompts to estimate cross-domain information, enhancing the model's generalization capability. These methods typically use a single textual prompt, often referred to as the one-step prompt method. However, when dealing with complex styles such as the combination of rain and night, we observe that the performance of the one-step prompt method tends to be relatively weak. The reason may be that many scenes incorporate not just a single style but a combination of multiple styles. The one-step prompt method may not effectively synthesize combined information involving various styles. To address this limitation, we propose a new method, i.e., Style Evolving along Chain-of-Thought, which aims to progressively integrate and expand style information along the chain of thought, enabling the continual evolution of styles. Specifically, by progressively refining style descriptions and guiding the diverse evolution of styles, this approach enables more accurate simulation of various style characteristics and helps the model gradually learn and adapt to subtle differences between styles. …
Poster
Yuanze Lin · Yunsheng Li · Dongdong Chen · Weijian Xu · Ronald Clark · Philip H.S. Torr

[ ExHall D ]

Abstract
We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks.
Poster
Bardia Safaei · Faizan Siddiqui · Jiacong Xu · Vishal M. Patel · Shao-Yuan Lo

[ ExHall D ]

Abstract
Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. Code will be …
Poster
JiHyeok Jung · EunTae Kim · SeoYeon Kim · Joo Ho Lee · Bumsoo Kim · Buru Chang

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user’s perspective, based on a consistent annotation standard derived from the user’s egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs' ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model’s capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs' orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at \url{https://anonymous.4open.science/r/EgocentricInstructionTuning-E189}.
Poster
Yunze Man · De-An Huang · Guilin Liu · Shiwei Sheng · Shilong Liu · Liangyan Gui · Jan Kautz · Yu-Xiong Wang · Zhiding Yu

[ ExHall D ]

Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective.
Poster
Xuanbai Chen · Xiang Xu · Zhihua Li · Tianchen Zhao · Pietro Perona · Qin ZHANG · Yifan Xing

[ ExHall D ]

Abstract
How can we troubleshoot a deep visual model, i.e., understand why it makes certain mistakes and further take action to correct its behavior? We design a Model Diagnosis and Correction system (MDC), an automated framework that analyzes the pattern of errors, proposes candidate causes of attributes, conducts hypothesis testing via attribute editing, and ultimately generates counterfactual training samples to improve the performance of the model. Unlike previous methods, in addition to the linguistic attributes, our method also incorporates the analysis for implicit causal attributes, those cannot to be accurately described by language. To achieve this, we propose an image editing module capable of leveraging both implicit and linguistic attributes to generate counterfactual images depicting error patterns and further experimentally validate causality relationships. Lastly, we enrich the training set with synthetic samples depicting verified causal attributes and retrain the model, further boosting accuracy and robustness. Extensive experiments on fine-grained classification and face security applications demonstrate the superiority of our approach in model diagnosis and correction. Specifically, we achieve an average relative improvement of 62.01\% in HTER for face security application over state-of-the-art methods.
Poster
Sébastien Piérard · Anaïs Halin · Anthony Cioppa · Adrien Deliege · Marc Van Droogenbroeck

[ ExHall D ]

Abstract
Ranking entities such as algorithms, devices, methods, or models based on their performances, while accounting for application-specific preferences, is a challenge. To address this challenge, we establish the foundations of a universal theory for performance-based ranking. First, we introduce a rigorous framework built on top of both the probability and order theories. Our new framework encompassesthe elements necessary to (1) define and manipulate performances, (2) express which performances are worse than or equivalent to others, (3) model tasks through a variable called satisfaction, (4) consider properties of the evaluation, (5) define scores, and (6) specify application-specific preferences through a variable called importance. On top of this framework, we propose the first axiomatic definition of performance orderings and performance-based rankings. Then, we introduce a universal parametric family of scores, called ranking scores, that can be used to establish rankings satisfying our axioms, while considering application-specific preferences. Finally, we show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores, including the accuracy, the true positive rate (recall), the positive predictive value (precision), Jaccard’s coefficient (intersection over union), and Fβ scores. However, we also show that some other scores commonly used to compare classifiers are unsuitable …
Poster
Sagar Soni · Akshay Dudhane · Hiyam Debary · Mustansar Fiaz · Muhammad Akhtar Munir · Muhammad Sohail Danish · Paolo Fraccaro · Campbell D Watson · Levente Klein · Fahad Shahbaz Khan · Salman Khan

[ ExHall D ]

Abstract
Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and {resource management}. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding.To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection.Our extensive experimental results on 37 downstream applications demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks. Our codes and data will be publicly released.
Poster
Yiyang Fang · Wenke Huang · Guancheng Wan · Kehua Su · Mang Ye

[ ExHall D ]

Abstract
Multimodal Emotion Recognition (MER) aims to predict human emotions by leveraging multiple modalities, such as vision, acoustics, and language. However, due to the heterogeneity of these modalities, MER faces two key challenges: modality balance dilemma and modality specialization disappearance. Existing methods often overlook the varying importance of modalities across samples in tackling the modality balance dilemma. Moreover, mainstream decoupling methods, while preserving modality-specific information, often neglect the predictive capability of unimodal data. To address these, we propose a novel model, Modality-Specific Enhanced Dynamic Emotion Experts (EMOE), consisting of: (1) Mixture of Modality Experts for dynamically adjusting modality importance based on sample features, and (2) Unimodal Distillation to retain single-modality predictive ability within fused features. EMOE enables adaptive fusion by learning a unique modality weight distribution for each sample, enhancing multimodal predictions with single-modality predictions to balance invariant and specific features in emotion recognition. Experimental results on benchmark datasets show that EMOE achieves superior or comparable performance to state-of-the-art methods. Additionally, we extend EMOE to Multimodal Intent Recognition (MIR), further demonstrating its effectiveness and versatility.
Poster
Fengxiang Wang · hongzhen wang · Zonghao Guo · Di Wang · Yulin Wang · Mingshuo Chen · Qiang Ma · Long Lan · Wenjing Yang · Jing Zhang · Zhiyuan Liu · Maosong Sun

[ ExHall D ]

Abstract
The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500×8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 6 kinds of perceptual abilities and 4 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed to enhance their performance in real RS scenarios. We will open source XLRS-Bench to support further research of developing more …
Poster
Erjian Guo · Zhen Zhao · Zicheng Wang · Tong Chen · YUNYI LIU · Luping Zhou

[ ExHall D ]

Abstract
Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module. Our DiN framework consistently outperforms existing methods across multiple benchmarks with varying noise levels.
Poster
Xiaofu Chen · Yaxin Luo · Luo · Jiayi Ji · Henghui Ding · Yiyi Zhou

[ ExHall D ]

Abstract
In this paper, we focus on weakly supervised referring expression comprehension (REC), and identify that the lack of fine-grained visual capability greatly limits the upper performance bound of existing methods. To address this issue, we propose a novel framework for weakly supervised REC, namely Dynamic Visual routing Network (DViN), which overcomes the visual shortcomings from the perspective of feature combination and alignment. In particular, DViN is equipped with a novel sparse routing mechanism to efficiently combine features of multiple visual encoders in a dynamic manner, thus improving the visual descriptive power. Besides, we further propose an innovative weakly supervised objective, namely Routing-based Feature Alignment (RFA), which facilitates the visual understanding of routed features through the intra-modal and inter-modal alignment. To validate DViN, we conduct extensive experiments on four REC benchmark datasets. Experiments demonstrate that DViN achieves state-of-the-art results on four benchmarks while maintaining competitive inference efficiency. Besides, the strong generalization ability of DViN is also validated on weakly supervised referring expression segmentation. Source codes are anonymously released at: https://anonymous.4open.science/r/DViN-7736.
Poster
Heng Yin · Yuqiang Ren · Ke Yan · Shouhong Ding · Yongtao Hao

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) have demonstrated strong language understanding and generation capabilities, excelling in visual tasks like referring and grounding. However, due to task type limitations and dataset scarcity, existing MLLMs only ground objects present in images and cannot reject non-existent objects effectively, resulting in unreliable predictions. In this paper, we introduce ROD-MLLM, a novel MLLM for Reliable Object Detection using free-form language. We propose a query-based localization mechanism to extract low-level object features. By aligning global and object-level visual information with text space, we leverage the large language model (LLM) for high-level comprehension and final localization decisions, overcoming the language understanding limitations of normal detectors. To enhance language-based object detection, we design an automated data annotation pipeline and construct the dataset ROD. This pipeline uses the referring capabilities of existing MLLMs and chain-of-thought techniques to generate diverse expressions corresponding to zero or multiple objects, addressing the shortage of training data. Experiments across various tasks, including referring, grounding, and language-based object detection, show that ROD-MLLM achieves state-of-the-art performance among MLLMs. Notably, in language-based object detection, our model achieves a +13.7 mAP improvement over existing MLLMs and surpasses most specialized detection models, especially in scenarios requiring advanced complex language understanding.
Poster
Guofeng Mei · Wei Lin · Luigi Riz · Yujiao Wu · Fabio Poiesi · Yiming Wang

[ ExHall D ]

Abstract
Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information.In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM.PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud.We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network.Lastly, we introduce a novel loss for local representation consensus to promote training stability.PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.
Poster
Zhantao Yang · Ruili Feng · Keyu Yan · Huangji Wang · Zhicai Wang · Shangwen Zhu · Han Zhang · Jie Xiao · Pingyu Wu · Kai Zhu · Jixuan Chen · Chen-Wei Xie · Yue Yang · Hongyang Zhang · Yu Liu · Fan Cheng

[ ExHall D ]

Abstract
Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions.To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information.We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V resources. Evaluations of overall quality, precision, and recall—as well as user studies—demonstrate that the resulting caption model consistently outperforms other state-of-the-art VLM models in generating high-quality captions.Additionally, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, …
Poster
Yang Qin · Chao Chen · Zhihang Fu · Dezhong Peng · Xi Peng · Peng Hu

[ ExHall D ]

Abstract
Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (TUI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, TUI refines user queries based on the MLLM responses to reduce the gap to the best-matching images, thereby boosting ranking accuracy. Additionally, to address the limitation of low-quality training texts, we introduce a novel Reorganization Data Augmentation (RDA) strategy based on information enrichment and diversity enhancement to enhance query discriminability by enriching, decomposing, and reorganizing person descriptions. Extensive experiments on four TIReID benchmarks, i.e., CUHK-PEDES, CFG-PEDES RSTPReid, RSTPReid, and UFine6926, demonstrate that our method achieves remarkable performance with substantial improvement. The code will be released publicly.
Poster
Zicheng Zhang · Tengchuan Kou · Chunyi Li · Shushi Wang · Wei Sun · Wei Wang · Xiaoyu Li · ZongYu Wang · Xuezhi Cao · Xiongkuo Min · Xiaohong Liu · Guangtao Zhai

[ ExHall D ]

Abstract
Evaluating text-to-vision content hinges on two crucial aspects: **visual quality** and **alignment**. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to **Scaling Law**, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models.Therefore, we introduce a comprehensive dataset designed to **E**valuate **V**isual quality and **A**lignment **L**evel for text-to-vision content (**Q-EVAL-100K**), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects.The **Q-EVAL-100K** dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose **Q-Eval-Score**, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment.Experimental results indicate that the proposed **Q-Eval-Score** achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the **Q-EVAL-100K** dataset. **The data and code will be released** to help promote the generation models.
Poster
Yuanmin Tang · Jue Zhang · Xiaoting Qin · Jing Yu · Gaopeng Gou · Gang Xiong · Qingwei Lin · Saravan Rajmohan · Dongmei Zhang · Qi Wu

[ ExHall D ]

Abstract
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. This dual-modality approach is especially valuable in internet search and e-commerce, facilitating tasks like scene image search with object manipulation and product recommendations with attribute changes. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code is available at https://anonymous.4open.science/r/osrcir24/.
Poster
Zhaoran Zhao · Peng Lu · Anran Zhang · Pei Pei Li · Xia Li · Xuannan Liu · Yang Hu · Shiyi Chen · liweiwang · Wenhao Guo

[ ExHall D ]

Abstract
With the rapid growth of social media and digital photography, visually appealing images have become essential for effective communication and emotional engagement. Among the factors influencing aesthetic appeal, composition—the arrangement of visual elements within a frame—plays a crucial role. In recent years, specialized models for photographic composition have achieved impressive results across various aesthetic tasks. Meanwhile, rapidly advancing multimodal large language models (MLLMs) have excelled in several visual perception tasks. However, their ability to embed and understand compositional information remains underexplored, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce the Photographic Image Composition Dataset (PICD), a large-scale dataset consisting of 36,857 images categorized into 24 composition categories across 355 diverse scenes. We demonstrate the advantages of PICD over existing datasets in terms of data scale, composition category, label quality, and scene diversity. Building on PICD, we establish benchmarks to evaluate the composition embedding capabilities of specialized models and the compositional understanding ability of MLLMs. To enable efficient and effective evaluation, we propose a novel Composition Discrimination Accuracy (CDA) metric. Our evaluation highlights the limitations of current models and provides insights into directions for improving their ability to embed and understand composition.
Poster
Vishaal Udandarao · Nikhil Parthasarathy · Muhammad Ferjad Naeem · Talfan Evans · Samuel Albanie · Federico Tombari · Yongqin Xian · Alessio Tonioni · Olivier J Henaff

[ ExHall D ]

Abstract
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach---active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.
Poster
Thao Nguyen · Krishna Kumar Singh · Jing Shi · Trung Bui · Yong Jae Lee · Yuheng Li

[ ExHall D ]

Abstract
Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users.However, they remain generic models and lack personalized knowledge of specific user concepts.Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation.In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models.Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a soft-positive" image generation approach to enhance image quality in a few-shot setting.
Poster
Zi-Han Jiang · Chien-Wei Lin · WeiHua Li · Hsuan-Tung Liu · Yi-Ren Yeh · Chu-Song Chen

[ ExHall D ]

Abstract
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks.
Poster
Zining Wang · Tongkun Guan · Pei Fu · Chen Duan · Qianyi Jiang · Zhentao Guo · Shan Guo · Junfeng Luo · Wei Shen · Xiaokang Yang

[ ExHall D ]

Abstract
Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements …
Poster
Zhaoqing Zhu · Chuwei Luo · Zirui Shao · Feiyu Gao · Hangdi Xing · Qi Zheng · Ji Zhang

[ ExHall D ]

Abstract
Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks.To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to …
Poster
Zhiyuan You · Xin Cai · Jinjin Gu · Tianfan Xue · Chao Dong

[ ExHall D ]

Abstract
With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone’s model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the **Di**stribution-based **m**ulti-modal **i**mage **Q**uality **A**ssessment model (DimiQA). Experiments across multiple benchmarks show that DimiQA stably outperforms baselines in score regression. Also, DimiQA can predict the score distribution that closely aligns with human annotations. Codes and model …
Poster
Mothilal Asokan · Kebin wu · Fatima Albreiki

[ ExHall D ]

Abstract
As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, FineLIP, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating Fine-grained alignment with Longer text input within the CLIP-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.
Poster
Lucas Morin · Valery Weber · Ahmed Nassar · Gerhard Ingmar Meijer · Luc Van Gool · Yawei Li · Peter W. J. Staar

[ ExHall D ]

Abstract
The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available upon acceptance.
Poster
Andrea Maracani · Savas Ozkan · Sijun Cho · Hyo-Won Kim · Eunchung Noh · Jeongwon Min · Cho Jung Min · Dookun Park · Mete Ozay

[ ExHall D ]

Abstract
Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
Poster
Xin Zhang · Robby T. Tan

[ ExHall D ]

Abstract
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in token length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.19 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code will be released upon acceptance.
Poster
Haoran Hao · Jiaming Han · Changsheng Li · Yu-Feng Li · Xiangyu Yue

[ ExHall D ]

Abstract
The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the **R**etrieval **A**ugmented **P**ersonalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, *e.g.*, user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such …
Poster
Omri Kaduri · Shai Bagon · Tali Dekel

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on the attention modules across layers, by which we reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by the model to store global image information; we demonstrate that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally. (iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.
Poster
Zhihe Yang · Xufang Luo · Dongqi Han · Yunjian Xu · Dongsheng Li

[ ExHall D ]

Abstract
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26\% on the AMBER benchmark and 5.39\% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
Poster
Chenxin Tao · Shiqian Su · Xizhou Zhu · Chenyu Zhang · Zhe Chen · Jiawen Liu · Wenhai Wang · Lewei Lu · Gao Huang · Yu Qiao · Jifeng Dai

[ ExHall D ]

Abstract
The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is …
Poster
Bo Tong · Bokai Lai · Yiyi Zhou · Luo · Yunhang Shen · Ke Li · Xiaoshuai Sun · Rongrong Ji

[ ExHall D ]

Abstract
Despite a big leap forward in capability, \emph{multimodal large language models} (MLLMs) tend to behave like a sloth in practical use, \emph{i.e.}, slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called \emph{\textbf{FlashSloth}}. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL-2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks. Our code is anonymously released at: \url{https://anonymous.4open.science/r/FlashSloth/}.
Poster
Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang

[ ExHall D ]

Abstract
The evolution of Large Vision-Language Models (LVLMs) has progressed from single-image understanding to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models like GPT-4o show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by these insights, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA effectively reduces position bias and significantly enhances the reasoning performance of existing LVLMs.
Poster
Dhouib Mohamed · Davide Buscaldi · Vanier Sonia · Aymen Shabou

[ ExHall D ]

Abstract
Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.
Poster
Long Xing · Qidong Huang · Xiaoyi Dong · Jiajie Lu · Pan Zhang · Yuhang Zang · Yuhang Cao · Conghui He · Jiaqi Wang · Feng Wu · Dahua Lin

[ ExHall D ]

Abstract
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom ''A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers.To this end, we propose ViCo, a conical-style visual concentration strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that ViCo can achieve over 40\% training time reduction and …
Poster
Le Zhang · Qian Yang · Aishwarya Agrawal

[ ExHall D ]

Abstract
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment methods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6\%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4\% zero-shot accuracy on ImageNet (vs. CLIP's 72.7\%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn …
Poster
Ke Zhu · Yu Wang · Yanpeng Sun · Qiang Chen · Jiang-Jiang Liu · gang zhang · Jingdong Wang

[ ExHall D ]

Abstract
Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.
Poster
Hao Yin · Guangzong Si · Zilei Wang

[ ExHall D ]

Abstract
Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model’s middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model’s bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.
Poster
Le Yang · Ziwei Zheng · Boxu Chen · Zhengyu Zhao · Chenhao Lin · Chao Shen

[ ExHall D ]

Abstract
Recent studies have shown that large vision-language models (LVLMs) often suffer from the issue of object hallucinations (OH). To mitigate this issue, we introduce an efficient method that edits the model weights based on an unsafe subspace, which we call HalluSpace in this paper. With truthful and hallucinated text prompts accompanying the visual content as inputs, the HalluSpace can be identified by extracting the hallucinated embedding features and removing the truthful representations in LVLMs. By orthogonalizing the model weights, input features will be projected into the Null space of the HalluSpace to reduce OH, based on which we name our method Nullu. We reveal that HalluSpaces generally contain statistical bias and unimodal priors of the large language models (LLMs) applied to build LVLMs, which have been shown as essential causes of OH in previous studies. Therefore, null space projection suppresses the LLMs' priors to filter out the hallucinated features, resulting in contextually accurate outputs. Experiments show that our method can effectively mitigate OH across different LVLM families without extra inference costs and also show strong performance in general LVLM benchmarks. Codes will be released at \url{url}.
Poster
Yuanchen Wu · Lu Zhang · Hang Yao · Junlong Du · Ke Yan · Shouhong Ding · Yunsheng Wu · Xiaoqiang Li

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) have achieved impressive results across various multi-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce “Antidote,” a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction and decouple the mitigation process into a preference optimization problem. Furthermore, we construct “CP-Bench,” a novel benchmark to evaluate LVLMs' ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.
Poster
Zhenting Wang · Shuming Hu · Shiyu Zhao · Xiaowen Lin · Felix Juefei-Xu · Zhuowei Li · Ligong Han · Harihar Subramanyam · Li Chen · Jianfa Chen · nan jiang · Lingjuan Lyu · Shiqing Ma · Dimitris N. Metaxas · Ankit Jain

[ ExHall D ]

Abstract
Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a …
Poster
Yuan-Hong Liao · Rafid Mahmood · Sanja Fidler · David Acuna

[ ExHall D ]

Abstract
Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.
Poster
Peng Xie · Yequan Bie · Jianda Mao · Yangqiu Song · Yang Wang · Hao Chen · Kani Chen

[ ExHall D ]

Abstract
Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy is able to effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides …
Poster
Sanghwan Kim · Rui Xiao · Iuliana Georgescu · Stephan Alaniz · Zeynep Akata

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates novel text-cropping strategy and cross-attention module into self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.
Poster
Ziliang Chen · Xin Huang · Xiaoxuan Fan · Keze Wang · Yuyu Zhou · Quanlong Guan · Liang Lin

[ ExHall D ]

Abstract
Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its generalization mechanism grasped massive research interests in the community. While existing studies limited in the scope of pre-training knowledge, hardly underpinned its generalization to countless open-world concepts absent from the pre-training regime. This paper dives into such Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, with regards to OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis evidences that despite image features of OOP concepts born with significant category margins, their zero-shot transfer significantly fails due to the poor image-text alignment. To this, we elaborate the name-tuning'' methodology with its theoretical merits in terms of OOP generalization, then propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority have been further verified in our comprehensive experiments.
Poster
Chong Yu · Tao Chen · Zhongxue Gan

[ ExHall D ]

Abstract
Vision-language model (VLM) is one of the most important models for mono-modal tasks. Real industrial applications often meet the challenge of adapting VLMs to different scenarios, such as varying hardware platforms or performance requirements. Traditional methods involve training or fine-tuning to adapt multiple unique VLMs or using model compression techniques to create multiple compact models. These approaches are complex and resource-intensive. This paper introduces a novel paradigm called Once-Tuning-Multiple-Variants (OTMV). OTMV requires only a single tuning process to inject dynamic weight expansion capacity into the VLM with dynamic expansion capacity. This tuned VLM can then be expanded into multiple variants tailored for different scenarios in inference. The tuning mechanism of OTMV is inspired by the mathematical series expansion theorem, which helps to reduce the parameter size and memory requirements while maintaining accuracy for VLM. Experiment results show that OTMV-tuned models achieve comparable accuracy to baseline VLMs across various visual-language tasks. The experiments also demonstrate the dynamic expansion capability of OTMV-tuned VLMs, outperforming traditional model compression and adaptation techniques in terms of accuracy and efficiency.
Poster
Shihan Wu · Ji Zhang · Pengpeng Zeng · Lianli Gao · Jingkuan Song · Heng Tao Shen

[ ExHall D ]

Abstract
Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/anonymity-007/SkipT.
Poster
Qi Zhu · Jiangwei Lao · Deyi Ji · Junwei Luo · Kang Wu · Yingying Zhang · Lixiang Ru · Jian Wang · Jingdong Chen · Ming Yang · Dong Liu · Feng Zhao

[ ExHall D ]

Abstract
Open-world interpretation aims to accurately localize and recognize all objects within images by vision-language models (VLMs). While substantial progress has been made in this task for natural images, the advancements for remote sensing (RS) images still remain limited, primarily due to these two challenges. 1) Existing RS semantic categories are limited, particularly for pixel-level interpretation datasets. 2) Distinguishing among diverse RS spatial regions solely by language space is challenging due to the dense and intricate spatial distribution in open-world RS imagery. To address the first issue, we develop a fine-grained RS interpretation dataset, Sky-SA, which contains 183,375 high-quality local image-text pairs with full-pixel manual annotations, covering 1,763 category labels, exhibiting richer semantics and higher density than previous datasets. Afterwards, to solve the second issue, we introduce the vision-centric principle for vision-language modeling. Specifically, in the pre-training stage, the visual self-supervised paradigm is incorporated into image-text alignment, reducing the degradation of general visual representation capabilities of existing paradigms. Then, we construct a visual-relevance knowledge graph across open-category texts and further develop a novel vision-centric image-text contrastive loss for fine-tuning with text prompts. This new model, denoted as SkySense-O, demonstrates impressive zero-shot capabilities on a thorough evaluation encompassing 14 datasets over 4 …
Poster
Fusheng Hao · Fengxiang He · Fuxiang Wu · Tichao Wang · Chengqun Song · Jun Cheng

[ ExHall D ]

Abstract
Prompt learning has attracted widespread attention in adapting vision-language models to downstream tasks. Existing methods largely rely on optimization strategies to ensure the task-awareness of learnable prompts. Due to the scarcity of task-specific data, overfitting is prone to occur. The resulting prompts often do not generalize well or exhibit limited task-awareness. To address this issue, we propose a novel Task-Aware Clustering (TAC) framework for prompting vision-language models, which increases the task-awareness of learnable prompts by introducing task-aware pre-context. The key ingredients are as follows: (a) generating task-aware pre-context based on task-aware clustering that can preserve the backbone structure of a downstream task with only a few clustering centers, (b) enhancing the task-awareness of learnable prompts by enabling them to interact with task-aware pre-context via the well-pretrained encoders, and (c) preventing the visual task-aware pre-context from interfering the interaction between patch embeddings by masked attention mechanism. Extensive experiments are conducted on benchmark datasets, covering the base-to-novel, domain generalization, and cross-dataset transfer settings. Ablation studies validate the effectiveness of key ingredients. Comparative results show the superiority of our TAC over competitive counterparts. The code will be made publicly available.
Poster
Yuxin Fan · Junbiao Cui · Jiye Liang

[ ExHall D ]

Abstract
Traditional semi-supervised learning achieves significant success in closed-world scenarios. To better align with the openness of the real world, researchers propose open-world semi-supervised learning (OWSSL), which enables models to effectively recognize known and unknown classes even without labels for unknown classes. Recently, researchers have attempted to enhance the model performance in recognizing visually similar classes by integrating textual information. However, these attempts do not effectively align images with text, resulting in limited improvements in model performance. In response to this challenge, we propose a novel OWSSL method. By adopting a global-and-local textual prompt learning strategy to enhance image-text alignment effectiveness, and implementing a forward-and-backward strategy to reduce noise in image-text matching for unlabeled samples, we ultimately enhance the model’s ability to extract and recognize discriminative features across different classes. Experimental results on multiple fine-grained datasets demonstrate that our method achieves significant performance improvements compared to state-of-the-art methods.
Poster
Taha Koleilat · Hojat Asgariandehkordi · Hassan Rivaz · Yiming Xiao

[ ExHall D ]

Abstract
Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available upon acceptance of the submission.
Poster
Giorgos Kordopatis-Zilos · Vladan Stojnić · Anna Manko · Pavel Suma · Nikolaos-Antonios Ypsilantis · Nikos Efthymiadis · Zakaria Laskar · Jiri Matas · Ondrej Chum · Giorgos Tolias

[ ExHall D ]

Abstract
This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. It is designed to evaluate the ability of current and future foundation models and retrieval techniques to recognize particular objects. The key benefits over existing datasets include large scale, domain diversity, accurate ground truth, and a performance that is far from saturated. ILIAS includes query and positive images for 1,000 object instances, manually collected to capture challenging conditions and diverse domains. Large-scale retrieval is conducted against 100 million distractor images from YFCC100M. To avoid false negatives without extra annotation effort, we include only query objects confirmed to have emerged after 2014, i.e. the compilation date of YFCC100M. An extensive benchmarking is performed with the following observations: i) models fine-tuned on specific domains, such as landmarks or products, excel in that domain but fail on ILIAS, ii) learning a linear adaptation layer using multi-domain class supervision results in performance improvements, especially for vision-and-language models, iii) local descriptors in retrieval re-ranking are still a key ingredient, especially in the presence of severe background clutter, iv) the text-to-image performance of the vision-language foundation models is surprisingly close to the corresponding image-to-image case.
Poster
Vishwesh Nath · Wenqi Li · Dong Yang · Andriy Myronenko · Yao Lu · Zhijian Liu · Danny Yin · Yucheng Tang · Pengfei Guo · Ziyue Xu · Can Zhao · Yufan He · Greg Heinrich · Mingxin Zheng · Benjamin D. Simon · Stephanie Anne Harmon · Michael Zephyr · Marc Edgar · Stephen R. Aylward · Pavlo Molchanov · Yan Mee LAW · Baris Turkbey · Holger R. Roth · Daguang Xu

[ ExHall D ]

Abstract
Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. Meanwhile, existing medical VLMs (e.g. Med-Gemini) often lack expert consultation as part of their design, and many rely on outdated, static datasets that were not created with modern, large deep learning models in mind. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical datafeatures that are often too intricate for a VLM to capture effectively. This paper introduces a new framework, VILA-M3, for …
Poster
Tahira Kazimi · Ritika Allada · Pinar Yanardag

[ ExHall D ]

Abstract
Classifiers are crucial to computer vision, yet their "black box" nature obscures the decision-making process, limiting the ability to trace the influence of individual features. Traditional interpretability methods, including GAN-based attribute editing, are constrained by domain and resource demands, often requiring extensive labeling and model-specific training. Text-to-image diffusion models, while promising for broader applications, lack precise semantics for classifier interpretation without extensive user input. We introduce DiffEx, a training-free framework that combines large language models (LLMs) and pre-trained diffusion models to improve classifier explainability. DiffEx leverages Vision-Language Models (VLMs) to build a comprehensive, hierarchical semantic corpus and applies a novel algorithm to rank impactful features, offering broad and fine-grained attributes that influence classifier scores. Our experiments show that DiffEx provides nuanced, interpretable insights across diverse domains, including medical diagnostics, making it versatile, scalable, and well-suited for understanding complex classifiers in critical applications.
Poster
Bo Wang · Dingwei Tan · Yen-Ling Kuo · Zhaowei Sun · Jeremy M Wolfe · Tat-Jen Cham · Mengmi Zhang

[ ExHall D ]

Abstract
Imagine searching a collection of coins for quarters (0.25), dimes (0.10), nickels (0.05), and pennies (0.01)—a hybrid foraging task where observers search for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision making process of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to click on each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and closely mirrors human foraging behavior in eye movements and click biases. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying …
Poster
Junjie Wang · BIN CHEN · Yulin Li · Bin Kang · Yichi Chen · Zhuotao Tian

[ ExHall D ]

Abstract
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain "content'' and "context'' features respectively. The "content'' features are aligned with image crop representations to improve local discriminability, while "context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code and models will be made publicly available.
Poster
Pedro Hermosilla · Christian Stippel · Leon Sick

[ ExHall D ]

Abstract
Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper aims to address this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a …
Poster
Zheda Mai · Ping Zhang · Cheng-Hao Tu · Hong-You Chen · Quang-Huy Nguyen · Li Zhang · Wei-Lun Chao

[ ExHall D ]

Abstract
Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyper-parameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementariness) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT …
Poster
Seungmin Baek · Soyul Lee · Hayeon Jo · Hyesong Choi · Dongbo Min

[ ExHall D ]

Abstract
Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods. Our code is available at supplementary material.
Poster
Xuan Cai · Renjie Pan · Hua Yang

[ ExHall D ]

Abstract
Pre-training + fine-tuning' has been widely used in various downstream tasks. Parameter-efficient fine-tuning (PEFT) has demonstrated higher efficiency and promising performance comapred to traditional full-tuning. The widely used adapter-based and prompt-based methods in PEFT can be uniformly represented as adding an MLP structure to the pre-trained model. These methods are prone to over-fitting in downstream tasks, due to the difference in data scale and distribution. To address this issue, we propose a new adapter-based PEFT module, i.e., LoKi, which consists of an encoder, a learnable activation layer, and a decoder. To maintain the simplicity of LoKi, we use single-layer linear networks for the encoder and decoder, and for the learnable activation layer, we use a Kolmogorov-Arnold Network (KAN) with the minimal number of layers (only 2 KAN linear layers). With a bottleneck rate much lower than that of Adapter, LoKi is equipped with fewer parameters (only half of Adapter) and eliminates slow training speed and high memory usage of KAN. We conduct extensive experiments on LoKi under image classification and video action recognition across 9 datasets. LoKi demonstrates highly competitive generalization performance compared to other PEFT methods with fewer tunable parameters, ensuring both effectiveness and efficiency. Code will be available.
Poster
Ondrej Tybl · Lukas Neumann

[ ExHall D ]

Abstract
Deep learning has revolutionized computer vision, but it achieved its tremendous success using deep network architectures which are mostly hand-crafted and therefore likely suboptimal. Neural Architecture Search (NAS) aims to bridge this gap by following a well-defined optimization paradigm which systematically looks for the best architecture, given objective criterion such as maximal classification accuracy. The main limitation of NAS is however its astronomical computational cost, as it typically requires training each candidate network architecture from scratch.In this paper, we aim to alleviate this limitation by proposing a novel training-free proxy for image classification accuracy based on Fisher Information. The proposed proxy has a strong theoretical background in statistics and it allows estimating expected image classification accuracy of a given deep network without training the network, thus significantly reducing computational cost of standard NAS algorithms. Our training-free proxy achieves state-of-the-art results on three public datasets and in two search spaces, both when evaluated using previously proposed metrics, as well as using a new metric that we propose which we demonstrate is more informative for practical NAS applications. The source code is publicly available.
Poster
Zhuguanyu Wu · Shihe Wang · Jiayi Zhang · Jiaxin Chen · Yunhong Wang

[ ExHall D ]

Abstract
Network quantization, a prevalent technique for network compression, significantly reduces computational demands and memory usage, thereby facilitating the deployment of large-parameter models onto hardware with constrained resources. Post-training quantization (PTQ) stands out as a cost-effective and promising approach due to its avoidance of the need for retraining. Unfortunately, many current PTQ methods in Vision Transformer (ViT) exhibit a notable decrease in accuracy, especially in lowbit cases. To tackle these challenges, we analyze the extensively utilized Hessian-guided quantization loss, and uncover certain limitations within the approximated pre-activation Hessian. By deducing the relationship between KL divergence and Fisher information matrix (FIM), we develop a more refined approximation for FIM. Building on this, we introduce the Diagonal Plus Low-Rank FIM (DPLR) to achieve a more nuanced quantization loss. Our extensive experiments, conducted across various ViT-based architectures on public benchmark datasets, demonstrate that our quantization loss calculation surpasses the performance of the prevalent mean squared error (MSE) and approximated pre-activation Hessian, and outperform previous work in lowbit cases. Code will be released upon acceptance.
Poster
Jiachen Zhu · Xinlei Chen · Kaiming He · Yann LeCun · Zhuang Liu

[ ExHall D ]

Abstract
Normalization layers are ubiquitous in modern neural networks and have long been considered essential.In this work, we demonstrate that strong performance can be achieved on Transformers without normalization layers, by using a remarkably simple technique.We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x)=tanh(αx), as a drop-in replacement for normalization layers in Transformers.DyT is inspired by the observation that layer normalization layers often produce tanh-like, S-shaped input-output mappings.By incorporating DyT, Transformers without any normalization layers can match or exceed the performance of their normalized counterparts, mostly without tuning training hyperparameters.We validate the efficacy of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep neural networks.
Poster
Abdelrahman Shaker · Syed Talal Wasim · Salman Khan · Jürgen Gall · Fahad Shahbaz Khan

[ ExHall D ]

Abstract
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication.Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of …
Poster
Sanghyeok Lee · Joonmyung Choi · Hyunwoo J. Kim

[ ExHall D ]

Abstract
For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively.Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens.Yet, efficient vision backbones built with SSM have been explored less.In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost.In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states.Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations.As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed.Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training.
Poster
Xiaoyong Lu · Songlin Du

[ ExHall D ]

Abstract
Existing state-of-the-art feature matchers capture long-range dependencies with Transformers but are hindered by high spatial complexity,leading to demanding training and high-latency inference.Striking a better balance between performance and efficiency remains a critical challenge in feature matching.Inspired by the linear complexity O(N) of Mamba, we propose an ultra-lightweight Mamba-based matcher, named JamMa, which converges on a single GPU and achieves an impressive performance-efficiency balance in inference.To unlock the potential of Mamba for feature matching,we propose Joint Mamba with a scan-merge strategy named JEGO, which enables:(1) Joint scan of two images to achieve high-frequency mutual interaction, (2) Efficient scan with skip steps to reduce sequence length, (3) Global receptive field, and (4) Omnidirectional feature representation.With the above properties, the JEGO strategy significantly outperforms the scan-merge strategies proposed in VMamba and EVMamba in the feature matching task.Compared to attention-based sparse and semi-dense matchers, JamMa demonstrates a notably superior balance between performance and efficiency,delivering better performance with less than 50% of the parameters and FLOPs.
Poster
Jiaxin Cai · Jingze Su · Qi Li · Wenjie Yang · Shu Wang · Tiesong Zhao · Shengfeng He · Wenxi Liu

[ ExHall D ]

Abstract
Multimodal semantic segmentation is a critical challenge in computer vision, with early methods suffering from high computational costs and limited transferability due to full fine-tuning of RGB-based pre-trained parameters. Recent studies, while leveraging additional modalities as supplementary prompts to RGB, still predominantly rely on RGB, which restricts the full potential of other modalities. To address these issues, we propose a novel symmetric parameter-efficient fine-tuning framework for multimodal segmentation, featuring with a modality-aware prompting and adaptation scheme, to simultaneously adapt the capabilities of a powerful pre-trained model to both RGB and X modalities. Furthermore, prevalent approaches use the global cross-modality correlations of attention mechanism for modality fusion, which inadvertently introduces noise across modalities. To mitigate this noise, we propose a dynamic sparse cross-modality fusion module to facilitate effective and efficient cross-modality fusion. To further strengthen the above two modules, we propose a training strategy that leverages accurately predicted dual-modality results to self-teach the single-modality outcomes. In comprehensive experiments, we demonstrate that our method outperforms previous state-of-the-art approaches across six multimodal segmentation scenarios with minimal computation cost.
Poster
Feng Wang · Jiahao Wang · Sucheng Ren · Guoyizhe Wei · Jieru Mei · Wei Shao · Yuyin Zhou · Alan L. Yuille · Cihang Xie

[ ExHall D ]

Abstract
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba---they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture MambaReg. Qualitative observations suggest, compared to vanilla Vision Mamba, MambaReg's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, MambaReg attains stronger performance and scales better. For example, on the ImageNet benchmark, our MambaReg-B attains 83.0% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 340M parameters), attaining a competitive accuracy of 83.6% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports MambaReg's efficacy.
Poster
Cheng Lei · Ao Li · Hu Yao · Ce Zhu · Le Zhang

[ ExHall D ]

Abstract
Parameter-efficient fine-tuning (PEFT) adapts pre-trained models to new tasks by updating only a small subset of parameters, achieving efficiency but still facing significant inference costs driven by input token length. This challenge is even more pronounced in pixel-level tasks, which require longer input sequences compared to image-level tasks. Although token reduction (TR) techniques can help reduce computational demands, they often lead to homogeneous attention patterns that compromise performance in pixel-level scenarios. This study underscores the importance of maintaining attention diversity for these tasks and proposes to enhance attention diversity while ensuring the completeness of token sequences. Our approach effectively reduces the number of tokens processed within transformer blocks, improving computational efficiency without sacrificing performance on several pixel-level tasks. We also demonstrate the superior generalization capability of our proposed method compared to challenging baseline models.
Poster
Rong Qin · Xin Liu · Xingyu Liu · Jiaxuan Liu · Jinglei Shi · Liang Lin · Jufeng Yang

[ ExHall D ]

Abstract
Over the last decade, many notable methods have emerged to tackle the computational resource challenge of the high resolution image recognition (HRIR). They typically focus on identifying and aggregating a few salient regions for classification, discarding sub-salient areas for low training consumption. Nevertheless, many HRIR tasks necessitate the exploration of wider regions to model objects and contexts, which limits their performance in such scenarios. To address this issue, we present a DBPS strategy to enable training with more patches at low consumption. Specifically, in addition to a fundamental buffer that stores the embeddings of most salient patches, DBPS further employs an auxiliary buffer to recycle those sub-salient ones. To reduce the computational cost associated with gradients of sub-salient patches, these patches are primarily used in the forward pass to provide sufficient information for classification. Meanwhile, only the gradients of the salient patches are back-propagated to update the entire network. Moreover, we design a Multiple Instance Learning (MIL) architecture that leverages aggregated information from salient patches to filter out uninformative background within sub-salient patches for better accuracy. Besides, we introduce the random patch drop to accelerate training process and uncover informative regions. Experiment results demonstrate the superiority of our method in …
Poster
Lu Yu · HaoYu Han · Zhe Tao · Hantao Yao · Changsheng Xu

[ ExHall D ]

Abstract
Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks.Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the model’s ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06\% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning. Code will be released upon the acceptance.
Poster
Shenghao Fu · Qize Yang · Qijie Mo · Junkai Yan · Xihan Wei · Jingke Meng · Xiaohua Xie · Wei-Shi Zheng

[ ExHall D ]

Abstract
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset will be available.
Poster
Yongkang Li · Tianheng Cheng · Bin Feng · Wenyu Liu · Xinggang Wang

[ ExHall D ]

Abstract
Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, *e.g.*, CLIP, to classify these masks via mask pooling.Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions.In this paper, we reveal the performance limitations of mask pooling and introduce **Mask-Adapter**, a simple yet effective method to address these challenges in open-vocabulary segmentation.Compared to directly using proposal masks, our proposed Mask-Adapter extracts *semantic activation maps* from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP.Additionally, we propose a *mask consistency loss* that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks.Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods.Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models will be made publicly available.
Poster
Zelin Peng · Zhengqin Xu · Zhilin Zeng · Yu Huang · Yaoming Wang · Wei Shen

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose \alg, a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately …
Poster
Xiao-Hui Li · Fei Yin · Cheng-Lin Liu

[ ExHall D ]

Abstract
Document image segmentation is crucial in document analysis and recognition but remains challenging due to the heterogeneity of document formats and diverse segmentation tasks. Existing methods often treat these tasks separately, leading to limited generalization and resource wastage.This paper introduces DocSAM, a transformer-based unified framework for various document image segmentation tasks, including document layout analysis, multi-granularity text segmentation, and table structure recognition by modelling these tasks as a combination of instance and semantic segmentation.Specifically, DocSAM uses a Sentence BERT to map category names from each dataset into semantic queries of the same dimension as instance queries. These queries interact through attention mechanisms and are cross-attended with image features to predict instance and semantic segmentation masks. To predict instance categories, instance queries are dot-producted with semantic queries, and scores are normalized using softmax.As a result, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computing and storage resources. Comprehensive evaluations show that DocSAM outperforms existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation in various applications.
Poster
Kunyu Wang · Xueyang Fu · Xin Lu · Chengjie Ge · Chengzhi Cao · Wei Zhai · Zheng-Jun Zha

[ ExHall D ]

Abstract
Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.
Poster
Chanyoung Kim · Dayun Ju · Woojung Han · Ming-Hsuan Yang · Seong Jae Hwang

[ ExHall D ]

Abstract
Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets. All the attached source code will be made available to the public.
Poster
Dong Zhao · Jinlong Li · Shuang Wang · Mengyao Wu · Qi Zang · Nicu Sebe · Zhun Zhong

[ ExHall D ]

Abstract
Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization.To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.
Poster
Jian Wang · Tianhong Dai · Bingfeng Zhang · Siyue Yu · ENG GEE LIM · Jimin Xiao

[ ExHall D ]

Abstract
Weakly Supervised Semantic Segmentation (WSSS) leverages Class Activation Maps (CAMs) to extract spatial information from image-level labels. However, CAMs primarily highlight the most discriminative foreground regions, leading to incomplete results. Prototype-based methods attempt to address this limitation by employing prototype CAMs instead of classifier CAMs. Nevertheless, existing prototype-based methods typically use a single prototype for each class, which is insufficient to capture all attributes of the foreground features due to the significant intra-class variations across different images. Consequently, these methods still struggle with incomplete CAM predictions. In this paper, we propose a novel framework called Prototypical Optimal Transport (POT) for WSSS. POT enhances CAM predictions by dividing features into multiple clusters and activating them separately using multiple cluster prototypes. In this process, a similarity-aware optimal transport is employed to assign features to the most probable clusters. This similarity-aware strategy ensures the prioritization of significant cluster prototypes, thereby improving the accuracy of feature assignment. Additionally, we introduce an adaptive OT-based consistency loss to refine feature representations. This framework effectively overcomes the limitations of single-prototype methods, providing more complete and accurate CAM predictions. Extensive experimental results on standard WSSS benchmarks (PASCAL VOC and MS COCO) demonstrate that our method significantly improves the …
Poster
Thanh-Dat Truong · Utsav Prabhu · Bhiksha Raj · Jackson Cothren · Khoa Luu

[ ExHall D ]

Abstract
Continual Learning in semantic scene segmentation aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This work presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SoTA) performance on different continual learning benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.
Poster
Shifan Zhang · Hongzi Zhu · Yinan He · Minyi Guo · Ziyang Lou · Shan Chang

[ ExHall D ]

Abstract
Computer-vision-based assessment on waste sorting is desired to replace manpower supervision in Shanghai city. Due to the hardness of labeling a multitude of waste images, it is infeasible to train a semantic segmentation model for this purpose directly. In this work, we construct a new dataset consisting of 12,208 waste images, upon which seed regions (i.e., patches) are annotated and classified into 21 categories in a crowdsourcing fashion. To obtain pixel-level labels to train an effective segmentation model, we propose a weakly-supervised waste image pseudo label generation scheme, called WISNet. Specifically, we train a cohesive feature extractor with contrastive prototype learning, incorporating an unsupervised classification pretext task to help the extractor focus on more discriminative regions even with the same category. Furthermore, we propose an effective iterative patch expansion method to generate accurate pixel-level pseudo labels. Given these generated pseudo labels, a few-shot segmentation model can be trained to segment waste images. We implement and deploy WISNet in two real-world scenarios and conduct intensive experiments. Results show that WISNet can achieve a state-of-the-art 40.2% final segmentation mIoU on our waste benchmark, outperforming all other baselines and demonstrating the efficacy of WISNet.
Poster
Tian Liu · Huixin Zhang · Shubham Parashar · Shu Kong

[ ExHall D ]

Abstract
Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves data from the VLM's pretraining set to learn better models for serving downstream tasks. RAL has been widely studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the …
Poster
Marco Garosi · Alessandro Conti · Gaowen Liu · Elisa Ricci · Massimiliano Mancini

[ ExHall D ]

Abstract
Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address …
Poster
Zilin Wang · Sangwoo Mo · Stella X. Yu · Sima Behpour · Liu Ren

[ ExHall D ]

Abstract
Unlike common categories for plants and animals, ad-hoc categories such as things to sell at a garage sale are created to help people achieve a certain task. Likewise, AI agents need to adaptively categorize visual scenes in response to changing tasks. We thus study open ad-hoc categorization, where we learn to infer novel concepts and name images according to a varying categorization purpose, a few labeled exemplars, and many unlabeled images.We develop a simple method that combines top-down text guidance (CLIP) with bottom-up image clustering (GCD) to learn contextualized visual features and align visual clusters with CLIP semantics, enabling predictions for both known and novel classes. Benchmarked on multi-label datasets Stanford and Clevr-4, our so-called OAK significantly outperforms baselines in providing accurate predictions across contexts and identifying novel concepts, e.g., it achieves 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. OAK offers interpretable saliency maps, focusing on hands, faces, and backgrounds for Action, Mood, and Location contexts, respectively.
Poster
Zhengyuan Peng · Jinpeng Ma · Zhimin Sun · Ran Yi · Haichuan Song · Xin Tan · Lizhuang Ma

[ ExHall D ]

Abstract
Generalized Category Discovery (GCD) is a classification task that aims to classify both base and novel classes in unlabeled images, using knowledge from a labeled dataset. In GCD, previous research typically treats scene information as noise and minimizes its influence during model training. However, in this paper, we argue that scene information should not be treated as noise, but rather recognized as a strong prior for inferring novel classes. We attribute the misinterpretation of scene information to a key factor: the Ambiguity Challenge inherent in GCD. Specifically, novel objects in base scenes might be wrongly classified into base categories, while base objects in novel scenes might be mistakenly recognized as novel categories. Once the ambiguity challenge is addressed, scene information can reach its full potential, significantly enhancing the performance of GCD models. To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. It achieves an exceptional average accuracy of 4\% improvement on the challenging fine-grained datasets compared to state-of-the-art methods, emphasizing its superior performance in GCD tasks.
Poster
Mankeerat Sidhu · Hetarth Chopra · Ansel Blume · Jeonghwan Kim · Revanth Gangi Reddy · Heng Ji

[ ExHall D ]

Abstract
In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. SearchDet retrieves a set of positive and negative images of an object to ground, embeds these images, and computes an input image--weighted query which is used to detect the desired concept in the image. Our proposed method is simple and training-free, yet achieves over 16.81\% mAP improvement on ODinW and 59.85\% mAP improvement on LVIS compared to state-of-the-art models such as GroundingDINO. We further show that our approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars, suggesting a path towards eliminating costly data annotation and training procedures.
Poster
Konstantinos Alexandridis · Ismail Elezi · Jiankang Deng · Anh Nguyen · Shan Luo

[ ExHall D ]

Abstract
Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose Fractal CALibration (FRACAL): a novel post-calibration method for long-tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post-processing method and it does not require any training, also it can be combined with many off-the-shelf models such as one-stage sigmoid detectors and two-stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. We provide the code in the Appendix.
Poster
Quentin Guimard · Moreno D'Incà · Massimiliano Mancini · Elisa Ricci

[ ExHall D ]

Abstract
A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect, which greatly limits the number of tasks where model biases can be identified. In this work, we develop Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting those together with task-specific target labels. A text-to-image retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to detect biases for any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art …
Poster
Shihua Huang · Zhichao Lu · Xiaodong Cun · Yongjun YU · Xiao Zhou · Xi Shen

[ ExHall D ]

Abstract
We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50\%. Notably, paired with RT-DETRv2, DEIM achieves 53.2\% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7\% and 56.4\% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code will …
Poster
Songlong Xing · Zhengyu Zhao · Nicu Sebe

[ ExHall D ]

Abstract
Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code …
Poster
Zhonghang Liu · Kun Zhou · Changshuo Wang · Daniel Lin · Jiangbo Lu

[ ExHall D ]

Abstract
How many outliers are within an unlabeled and contaminated dataset? Despite a series of unsupervised outlier detection (UOD) approaches have been proposed, they cannot correctly answer this critical question, resulting in their performance instability across various real-world (varying contamination factor) scenarios. To address this problem, we propose FlexUOD, with a novel contamination factor estimation perspective. FlexUOD not only achieves its remarkable robustness but also is a general and plug-and-play framework, which can significantly improve the performance of existing UOD methods. Extensive experiments demonstrate that FlexUOD achieves state-of-the-art results as well as high efficacy on diverse evaluation benchmarks.
Poster
Zhaopeng Gu · Bingke Zhu · Guibo Zhu · Yingying Chen · Ming Tang · Jinqiao Wang

[ ExHall D ]

Abstract
Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often follow a one-category-one-model" paradigm, requiring large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, such as industrial, logical, and medical anomalies, with a training-free unified model. UniVAD only needs a few normal samples as references during testing to detect anomalies in previously unseen objects without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering (C3) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments …
Poster
Jinjin Zhang · Guodong Wang · yizhou jin · Di Huang

[ ExHall D ]

Abstract
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce **LogSAD**, a novel multi-modal framework that requires no training for both **Log**ical and **S**tructural **A**nomaly **D**etection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision.Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness.Code will be made publicly available soon.
Poster
wenbing zhu · Lidong Wang · Ziqing Zhou · Chengjie Wang · Yurui Pan · Ruoyi.Zhang · Zhuhao Chen · Linjie Cheng · Bin-Bin Gao · Jiangning Zhang · Zhenye Gan · Yuxie Wang · Yulong Chen · Bruce Qian · Mingmin Chi · Bo Peng · Lizhuang Ma

[ ExHall D ]

Abstract
The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D³, a high-precision multimodal dataset that uniquely incorporates an additional pseudo-3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds.Real-IAD D³ comprises industrial components with smaller dimensions and finer defects than existing datasets, offering diverse anomalies across modalities and presenting a more challenging benchmark for multimodal IAD research. With 20 product categories, the dataset offers significantly greater scale and diversity compared to current alternatives. Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The Real-IAD D³ dataset will be publicly available to advance research and innovation in multimodal IAD.
Poster
Wu Sheng · Yimi Wang · Xudong Liu · Yuguang Yang · Runqi Wang · Guodong Guo · David Doermann · Baochang Zhang

[ ExHall D ]

Abstract
Feature matching methods for unsupervised anomaly detection have demonstrated impressive performance. Existing methods primarily rely on self-supervised training and handcrafted matching schemes for task adaptation. However, they can only achieve an inferior feature representation for anomaly detection because the feature extraction and matching modules are separately trained. To address these issues, we propose a Differentiable Feature Matching (DFM) framework for joint optimization of the feature extractor and the matching head. DFM transforms nearest-neighbor matching into a pooling-based module and embeds it within a Feature Matching Network (FMN). This design enables end-to-end feature extraction and feature matching module training, thus providing better feature representation for anomaly detection tasks. DFM is generic and can be incorporated into existing feature-matching methods. We implement DFM with various backbones and conduct extensive experiments across various tasks and datasets, demonstrating its effectiveness. Notably, we achieve state-of-the-art results in the continual anomaly detection task with instance-AUROC improvement of up to 3.9% and pixel-AP improvement of up to 5.5%.
Poster
Xiaoyi Qu · David Aponte · Colby Banbury · Daniel Robinson · Tianyu Ding · Kazuhito Koishida · Ilya Zharkov · Tianyi Chen

[ ExHall D ]

Abstract
Structured pruning and quantization are fundamental techniques used to reduce the size of neural networks, and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemas are not widely used because of (1) engineering difficulties (complicated multi-stage processes and hardware inefficiencies), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any deep neural network. GETA introduces three key innovations: (i) aquantization-aware dependency graph analysis that constructs a pruning search space, (ii) a partially projected stochastic gradient method that guarantees a layerwise bit constraint is satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to state-of-the-art joint pruning and quantization methods.
Poster
Xiao Cui · Yulei Qin · Wengang Zhou · Hongsheng Li · Houqiang Li

[ ExHall D ]

Abstract
The demands for increasingly large-scale datasets pose substantial storage and computation challenges to building deep learning models.Dataset distillation methods,especially those via sample generation techniques,rise in response to condensing large original datasets into small synthetic ones while preserving critical information.Existing subset synthesis methods simply minimize the homogeneous distance where uniform contributions from all real instances are allocated to shaping each synthetic sample.We demonstrate that such equal allocation fails to consider the instance-level relationship between each real-synthetic pair and gives rise to insufficient modeling of geometric structural nuances between the distilled and original sets.In this paper,we propose a novel framework named OPTICAL to reformulate the homogeneous distance minimization into a bi-level optimization problem via matching-and-approximating.In the matching step,we leverage optimal transport matrix to dynamically allocate contributions from real instances.Subsequently,we polish the generated samples in accordance with the established allocation scheme for approximating the real ones.Such a strategy better measures intricate geometric characteristics and handles intra-class variations for high fidelity of data distillation.Extensive experiments across seven datasets and three model architectures demonstrate our method's versatility and effectiveness.Its plug-and-play characteristic makes it compatible with a wide range of distillation frameworks.Codes are available at https://anonymous.4open.science/r/CVPR2025_696.
Poster
Yushuai Sun · Zikun Zhou · Dongmei Jiang · Yaowei Wang · Jun Yu · Guangming Lu · Wenjie Pei

[ ExHall D ]

Abstract
Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. The code will be made publicly available.
Poster
Biqing Qi · Fangyuan Li · Zhen Wang · Junqi Gao · Dong Li · Peng Ye · Bowen Zhou

[ ExHall D ]

Abstract
As an effective approach to equip models with multi-task capabilities without additional training, model merging has garnered significant attention. However, existing merging methods face challenges of redundant parameter conflicts and the excessive storage burden of fine-tuned parameters. In this work, through controlled experiments, we reveal that for fine-tuned task vectors, only those parameters with magnitudes above a certain threshold contribute positively to the task, exhibiting a pulse-like characteristic. We then attempt leveraging this pulse-like characteristic to binarize the task vectors and reduce storage overhead. Further controlled experiments show that the binarized task vectors incur almost no decrease in fine-tuning and merging performance, and even exhibit stronger performance improvements as the proportion of redundant parameters increases. Based on these insights, we propose Task Switch (T-Switch), which decomposes task vectors into three components: 1) an activation switch instantiated by a binarized mask vector, 2) a polarity switch instantiated by a binarized sign vector, and 3) a scaling knob instantiated by a scalar coefficient. By storing task vectors in a binarized form, T-Switch alleviates parameter conflicts while ensuring efficient task parameter storage. Furthermore, to enable automated switch combination in T-Switch, we further introduce Auto-Switch, which enables training-free switch combination via retrieval from a …
Poster
Carlos Garrido-Munoz · Jorge Calvo-Zaragoza

[ ExHall D ]

Abstract
Recent advances in Handwritten Text Recognition (HTR) have led to significant reductions in transcription errors on standard benchmarks under the i.i.d. assumption, thus focusing on minimizing in-distribution (ID) errors.However, this assumption does not hold in real-world applications, which has motivated HTR research to explore Transfer Learning and Domain Adaptation techniques. In this work, we investigate the unaddressed limitations of HTR models in generalizing to out-of-distribution (OOD) data. We adopt the challenging setting of Domain Generalization, where models are expected to generalize to OOD data without any prior access. To this end, we analyze 336 OOD cases from eight state-of-the-art HTR models across seven widely used datasets, spanning five languages. Additionally, we study how HTR models leverage synthetic data to generalize. We reveal that the most significant factor for generalization lies in the textual divergence between domains, followed by visual divergence. We demonstrate that the error of HTR models in OOD scenarios can be reliably estimated, with discrepancies falling below 10 points in 70\% of cases. We identify the underlying limitations of HTR models, laying the foundation for future research to address this challenge.
Poster
Tao Sun · Yuhao Huang · Li Shen · Kele Xu · Bao Wang

[ ExHall D ]

Abstract
Weight decay is a widely used technique in training machine learning models, known to empirically enhance the generalization of Stochastic Gradient Descent (SGD). While intuitively weight decay allows SGD to train a regularized model rather than the original one, there is limited theoretical understanding of why SGD with weight decay (SGDW) yields results consistent with the unregularized model, or how weight decay improves generalization. This paper establishes a convergence theory for SGDW in the context of the unregularized model, under weaker assumptions than previous analyses of weight decay. Our theory demonstrates that weight decay does not accelerate the convergence of SGD. For generalization, we provide the first theoretical proof of weight decay's benefit in nonconvex optimization. Additionally, we extend our results to sign-based stochastic gradient algorithms, such as SignSGD. Numerical experiments on classical benchmarks validate our theoretical findings.
Poster
Yusong Hu · Zichen Liang · Fei Yang · Qibin Hou · Xialei Liu · Ming-Ming Cheng

[ ExHall D ]

Abstract
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning.
Poster
Xuan Liu · Xiaobin Chang

[ ExHall D ]

Abstract
In continual learning (CL), catastrophic forgetting often arises due to feature drift.This challenge is particularly prominent in the exemplar-free continual learning (EFCL) setting, where samples from previous tasks cannot be retained.Therefore, the model struggles to maintain prior knowledge, leading to a more significant performance drop on an older task.To ensure consistent representations across tasks, it is vital to mitigate feature drift.Some EFCL methods aim to identify feature spaces that minimize the impact on previous tasks while accommodating new ones.However, they rely on static features or outdated statistics from old tasks, which prevents them from capturing the dynamic evolution of the feature space in CL, leading to performance degradation.In this paper, we introduce the Drift-Resistant Space (DRS), which effectively handles feature drifts without requiring explicit feature modeling or the storage of previous tasks.A novel parameter-efficient fine-tuning method called Low-Rank Adaptation Subtraction (LoRA) is proposed to develop the DRS.This method subtracts the LoRA weights of old tasks from the initial pre-trained weight before processing new task data to establish the DRS for model training.Therefore, LoRA enhances stability, improves efficiency, and simplifies implementation.Furthermore, stabilizing feature drifts allows for better plasticity by learning with a triplet loss.Extensive experiments across multiple datasets show that our …
Poster
Chenggong Ni · Fan Lyu · Jiayao Tan · Fuyuan Hu · Rui Yao · Tao Zhou

[ ExHall D ]

Abstract
This paper introduces Topological Consistency Adaptation (TCA), a novel approach to Continual Test-time Adaptation (CTTA) that addresses the challenges of domain shifts and error accumulation in testing scenarios. TCA ensures the stability of inter-class relationships by enforcing a class topological consistency constraint, which minimizes the distortion of class centroids and preserves the topological structure during continuous adaptation. Additionally, we propose an intra-class compactness loss to maintain compactness within classes, indirectly supporting inter-class stability. To further enhance model adaptation, we introduce a batch imbalance topology weighting mechanism that accounts for class distribution imbalances within each batch, optimizing centroid distances and stabilizing the inter-class topology. Experiments show that our method demonstrates improvements in handling continuous domain shifts, ensuring stable feature distributions and boosting predictive performance.
Poster
Juntae Lee · Munawar Hayat · Sungrack Yun

[ ExHall D ]

Abstract
Few-shot class incremental learning (FSCIL) enables the continual learning of new concepts with only a few training examples. In FSCIL, the model undergoes substantial updates, making it prone to forgetting previous concepts and overfitting to the limited new examples. Most recent trend is typically to disentangle the learning of the representation from the classification head of the model. A well-generalized feature extractor on the base classes (many examples and many classes) is learned, and then fixed during incremental learning. Arguing that the fixed feature extractor restricts the model's adaptability to new classes, we introduce a novel FSCIL method to effectively address catastrophic forgetting and overfitting issues. Our method enables to seamlessly update the entire model with a few examples. We mainly propose a tripartite weight-space ensemble (Tri-WE). Tri-WE interpolates the base, immediately previous, and current models in weight-space, especially for the classification heads of the models. Then, it collaboratively maintains knowledge from the base and previous models. In addition, we recognize the challenges of distilling generalized representations from the previous model from scarce data. Hence, we suggest a regularization loss term using amplified data knowledge distillation. Simply intermixing the few-shot data, we can produce richer data enabling the distillation of …
Poster
Seong-Hyeon Hwang · Minsu Kim · Steven Euijong Whang

[ ExHall D ]

Abstract
We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental learning, as retaining a sufficient validation set would be impractical. Thus, we propose T-CIL, a novel temperature scaling approach for class-incremental learning without a validation set for old tasks, that leverages adversarially perturbed exemplars from memory. Directly using exemplars is inadequate for temperature optimization, since they are already used for training. The key idea of T-CIL is to perturb exemplars more strongly for old tasks than for the new task by adjusting the perturbation direction based on feature distance, with the single magnitude determined using the new-task validation set. This strategy makes the perturbation magnitude computed from the new task also applicable to old tasks, leveraging the tendency that the accuracy of old tasks is lower than that of the new task. We empirically show that T-CIL significantly outperforms various baselines in terms of calibration on real datasets and can be integrated with existing class-incremental learning techniques with …
Poster
Aodi Li · Liansheng Zhuang · Xiao Long · MingHong Yao · Shafei Wang

[ ExHall D ]

Abstract
Domain generalization aims to learn a model from multiple training domains and generalize it to unseen test domains. Recent theory has shown that seeking the deep models, whose parameters lie in the flat minima of the loss landscape, can significantly reduce the out-of-domain generalization error. However, existing methods often neglect the consistency of loss landscapes in different domains, resulting in models that are not simultaneously in the optimal flat minima in all domains, which limits their generalization ability. To address this issue, this paper proposes an iterative Self-Feedback Training (SFT) framework to seek consistent flat minima that are shared across different domains by progressively refining loss landscapes during training. It alternatively generates a feedback signal by measuring the inconsistency of loss landscapes in different domains and refines these loss landscapes for greater consistency using this feedback signal. Benefiting from the consistency of the flat minima within these refined loss landscapes, our SFT helps achieve better out-of-domain generalization. Extensive experiments on DomainBed demonstrate superior performances of SFT when compared to state-of-the-art sharpness-aware methods and other prevalent DG baselines. On average across five DG benchmarks, SFT surpasses the sharpness-aware minimization by 2.6\% with ResNet-50 and 1.5\% with ViT-B/16, respectively. The code will …
Poster
Dongkyu Cho · Inwoo Hwang · Sanghack Lee

[ ExHall D ]

Abstract
Data augmentation is a popular tool for single source domain generalization, which expands the source domain by generating simulated ones, improving generalization on unseen target domains. In this work, we show that the performance of such augmentation-based methods in the target domains universally fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating knowledge over the training steps. Maximizing the mutual information between the output representations of the two models guides the learning process of the proxy model, mitigating feature distortion during training. Experimental results demonstrate the effectiveness of PEER in reducing the OOD performance fluctuation and enhancing generalization across various datasets, including PACS, Digits, Office-Home, and VLCS. Notably, our method with simple random augmentation achieves state-of-the-art performance, surpassing prior approaches on sDG that utilize complex …
Poster
Marzi Heidari · Abdullah Alchihabi · Hao Yan · Yuhong Guo

[ ExHall D ]

Abstract
In this work, we introduce a novel problem setup termed as Heterogeneous Semi-Supervised Learning (HSSL), which presents unique challenges by bridging the semi-supervised learning (SSL) task and the unsupervised domain adaptation (UDA) task, and expanding standard semi-supervised learning to cope with heterogeneous training data. At its core, HSSL aims to learn a prediction model using a combination of labeled and unlabeled training data drawn separately from heterogeneous domains that share a common set of semantic categories; this model is intended to differentiate the semantic categories of test instances sampled from both the labeled and unlabeled domains. In particular, the labeled and unlabeled domains have dissimilar label distributions and class feature distributions. This heterogeneity, coupled with the assorted sources of the test data, introduces significant challenges to standard SSL and UDA methods. Therefore, we propose a novel method, Unified Framework for Heterogeneous Semi-supervised Learning (Uni-HSSL), to address HSSL by directly learning a fine-grained classifier from the heterogeneous data, which adaptively handles the inter-domain heterogeneity while leveraging both the unlabeled data and the inter-domain semantic class relationships for cross-domain knowledge transfer and adaptation. We conduct comprehensive experiments and the experimental results validate the efficacy and superior performance of the proposed Uni-HSSL over …
Poster
Bo Cheng · Jueqing Lu · Yuan Tian · Haifeng Zhao · Yi Chang · Lan Du

[ ExHall D ]

Abstract
Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model …
Poster
Yucong Dai · Shilin Gu · Ruidong Fan · Chao Xu · Chenping Hou

[ ExHall D ]

Abstract
Label shift, which investigates the adaptation of label distributions between the fixed source and target domains, has attracted significant research interests and broad applications in offline settings. In real-world scenarios, however, data often arrives as a continuous stream. Addressing label shift in online learning settings is paramount. Existing strategies, which tailor traditional offline label shift techniques to online settings, have degraded performance due to the inconsistent estimation of label distributions and violation of convex assumption for theoretical guarantee. In this paper, we propose a novel method to ensure consistent adaptation to online label shift. We construct a new convex risk estimator that is pivotal for both online optimization and theoretical analysis. Furthermore, we enhance an optimistic online algorithm as the base learner and refine the classifier using an ensemble method. Theoretically, we derive a universal dynamic regret which achieves minimax optimal. Extensive experiments on both real-world datasets and human motion task demonstrate the superiority of our method comparing existing methods.
Poster
Zhuo Xu · Xiang Xiang · Yifan Liang

[ ExHall D ]

Abstract
Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background in images leads to significant shortcut issues that adversely affect out-of-distribution (OOD) detection abilities. When confronted with a background OOD sample, VLMs are prone to misidentifying it as in-distribution (ID) data. In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. We first decouple images into ID-relevant and ID-irrelevant regions and utilize the latter to generate a large number of augmented OOD background samples as pseudo-OOD supervision. We then use the masks from background decoupling to adjust the model's attention, minimizing its focus on ID-irrelevant regions. To assess the model's robustness against background interference, we introduce a new OOD evaluation dataset, ImageNet-Bg, which solely consists of background images with all ID-relevant regions removed. Our method demonstrates exceptional performance in few-shot scenarios, achieving strong results even in one-shot setting, and outperforms existing methods.
Poster
Yuhang Liu · Wenjie Zhao · Yunhui Guo

[ ExHall D ]

Abstract
Task incremental learning (TIL) is a specific form of continual learning (CL), wherein the model is trained on a set of distinguishable tasks. However, current TIL methodologies are predicated on the closed-world assumption, which posits that test data remains in-distribution (ID). When deployed in an open-world scenario, test samples can be from out-of-distribution (OOD) sources. Current OOD detection methods primarily rely on model outputs, leading to an over-dependence on model performance. Additionally, a threshold is required to distinguish between ID and OOD, limiting their practical application. Moreover, these methods can only achieve coarse-grained binary classification and cannot obtain task identity. To address this, we propose Hierarchical Two-sample Tests (H2ST), which is compatible with any existing replay-based TIL frameworks. H2ST eliminates the necessity for thresholds by employing hypothesis testing while leveraging feature maps to harness the model's capabilities without excessive dependence. The proposed hierarchical architecture incorporates a task-level detection mechanism, simplifying classification for individual classifiers. Extensive experiments and analysis demonstrate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods.
Poster
Litian Liu · Yao Qin

[ ExHall D ]

Abstract
Out-of-Distribution (OOD) detection is critical for safe deployment; however, existing detectors often struggle to generalize across datasets of varying scales and model architectures, and some can incur high computational costs in real-world applications. Inspired by the phenomenon of Neural Collapse, we propose a versatile and efficient OOD detection method. Specifically, we re-characterize prior observations that in-distribution (ID) samples form clusters, demonstrating that, with appropriate centering, these clusters align closely with model weight vectors. Additionally, we reveal that ID features tend to expand into a simplex Equiangular Tight Frame, explaining the common observation that ID features are situated farther from the origin than OOD features. Incorporating both insights from Neural Collapse, our OOD detector leverages feature proximity to weight vectors and complements this approach by using feature norms to effectively filter out OOD samples. Extensive experiments on off-the-shelf models demonstrate the robustness of our OOD detector across diverse scenarios, mitigating generalization discrepancies and enhancing overall performance, with inference latency comparable to that of the basic softmax-confidence detector.
Poster
Chenhe Hao · Weiying Xie · Daixun Li · Haonan Qin · Hangyu Ye · Leyuan Fang · Yunsong Li

[ ExHall D ]

Abstract
Federated Learning (FL) is an emerging direction in distributed machine learning that enables jointly training a model without sharing the data. However, as the size of datasets grows exponentially, computational costs of FL increase. In this paper, we propose the first Coreset Selection criterion for Federated Learning (FedCS) by exploring the Distance Contrast (DC) in feature space. Our FedCS is inspired by the discovery that DC can indicate the intrinsic properties inherent to samples regardless of the networks. Based on the observation, we develop a method that is mathematically formulated to prune samples with high DC. The principle behind our pruning is that high DC samples either contain less information or represent rare extreme cases, thus removal of them can enhance the aggregation performance. Besides, we experimentally show that samples with low DC usually contain substantial information and reflect the common features of samples within their classes, such that they are suitable for constructing coreset. With only two time of linear-logarithmic complexity operation, FedCS leads to significant improvements over the methods using whole dataset in terms of computational costs, with similar accuracies. For example, on the CIFAR-10 dataset with Dirichlet coefficient α=0.1, FedCS achieves 58.88% accuracy using only 44% of …
Poster
Hao Zheng · Zhigang Hu · Boyu Wang · Liu Yang · Meiguang Zheng · Aikun Xu

[ ExHall D ]

Abstract
Server aggregation conflict is a key challenge in personalized federated learning (PFL). While existing PFL methods have achieved significant progress with shallow base models (e.g., four-layer CNNs), they often overlook the negative impacts of deeper base models on personalization mechanisms. In this paper, we identify the phenomenon of deep model degradation in PFL, where as base model depth increases, the model becomes more sensitive to local client data distributions, thereby exacerbating server aggregation conflicts and ultimately reducing overall model performance. Moreover, we show that these conflicts manifest in insufficient global average updates and mutual constraints between clients. Motivated by our analysis, we proposed a two-stage conflict-aware layer-wise mitigation algorithm, which first constructs a conflict-free global update to alleviate negative conflicts, and then alleviates the conflicts between clients through a conflict-aware strategy.Notably, our method naturally leads to a selective mechanism that balances the tradeoff between clients involved in aggregation and the tolerance for conflicts. Consequently, it can boost the positive contribution to the clients even with the greatest conflicts with the global update.Extensive experiments across multiple datasets and deeper base models demonstrate that FedCALM outperforms four state-of-the-art (SOTA) methods by up to 9.88\% and seamlessly integrates into existing PFL methods with …
Poster
Yueqi Xie · Minghong Fang · Neil Zhenqiang Gong

[ ExHall D ]

Abstract
Model poisoning attacks are critical security threats to Federated Learning (FL). Existing model poisoning attacks suffer from two key limitations: 1) they achieve suboptimal effectiveness when defenses are deployed, and/or 2) they require knowledge of the model updates or local training data on genuine clients. In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. In light of this observation, we propose PoisonedFL, which enforces multi-round consistency among the malicious clients' model updates while not requiring any knowledge about the genuine clients.Our empirical evaluation on five benchmark datasets shows that \ourmodel{} breaks eight state-of-the-art defenses and outperforms seven existing model poisoning attacks. Our study shows that FL systems are considerably less robust than previously thought, underlining the urgency for the development of new defense mechanisms.
Poster
Zihan Tan · Guancheng Wan · Wenke Huang · Guibin Zhang · He Li · Carl Yang · Mang Ye

[ ExHall D ]

Abstract
Federated Graph Learning (FGL) has emerged as a solution to address real-world privacy concerns and data silos in graph learning, which relies on Graph Neural Networks (GNNs).Nevertheless, the homophily level discrepancies within the local graph data of clients, termed homophily heterogeneity, significantly degrade the generalizability of a global GNN. Existing research ignores this issue and suffers from unpromising collaboration. In this paper, we propose FedSPA, an effective hyperparameter-free framework that addresses homophily heterogeneity from the perspectives of homophily conflict and homophily bias, concepts that have yet to be defined or explored.In the first place, the homophily conflict arises when training on inconsistent homophily levels across clients. Correspondingly, we propose Subgraph Feature Propagation Decoupling (SFPD), thereby achieving collaboration on unified homophily levels across clients. To further address homophily bias, we design Homophily Bias-Driven Aggregation (HBDA) which emphasizes clients with lower biases. It enables the adaptive adjustment of each client contribution to the global GNN based on its homophily bias. The superiority of FedSPA is validated through extensive experiments.
Poster
Wang Yu-Hang · Junkang Guo · Aolei Liu · Kaihao Wang · Zaitong Wu · Zhenyu Liu · Wenfei Yin · Jian Liu

[ ExHall D ]

Abstract
Adversarial robustness remains a significant challenge in deploying deep neural networks for real-world applications. While adversarial training is widely acknowledged as a promising defense strategy, most existing studies primarily focus on balanced datasets, neglecting the fact that real-world data often exhibit a long-tailed distribution, which introduces substantial challenges to robustness. In this paper, we provide an in-depth analysis of adversarial training in the context of long-tailed distributions and identify the limitations of the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which incorporates an initial stabilization phase followed by a stratified, equalization adversarial training phase. Furthermore, prior work on long-tailed robustness has largely overlooked a crucial evaluation metric—balanced accuracy. To fill this gap, we introduce the concept of balanced robustness, a comprehensive metric that measures robustness specifically under long-tailed distributions. Extensive experiments demonstrate that our method outperforms existing advanced defenses, yielding significant improvements in both memory and computational efficiency. We believe this work represents a substantial step forward in tackling robustness challenges in real-world applications.Supplementary material contains our code.
Poster
WEIWEI LI · Junzhuo Liu · Yuanyuan Ren · Yuchen Zheng · Yahao Liu · Wen Li

[ ExHall D ]

Abstract
Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfying performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spuriousfeatures based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spuriousfeatures by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on four image and NLP debiasing benchmarks and one medical dataset demonstrate the effectiveness of …
Poster
Jinxu Lin · Linwei Tao · Minjing Dong · Chang Xu

[ ExHall D ]

Abstract
Model calibration is essential for ensuring that the predictions of deep neural networks accurately reflect true probabilities in real-world classification tasks.However, deep networks often produce over-confident or under-confident predictions, leading to miscalibration.Various methods have been proposed to address this issue by designing effective loss functions for calibration, such as focal loss. In this paper, we analyze its effectiveness and provide a unified loss framework of focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor that estimates sample-wise uncertainty.Based on our analysis, existing loss functions fail to achieve optimal calibration performance due to two main issues: including misalignment in optimization and insufficient precision in uncertainty estimation.Specifically, focal loss cannot align sample uncertainty with gradient scaling and the single logit cannot indicate the uncertainty.To address these issues, we reformulate the optimization from the perspective of gradients, which focuses on uncertain samples. Meanwhile, we propose to use the Brier Score as the loss weight factor, which provides a more accurate uncertainty estimation via all the logits. Extensive experiments on various models and datasets demonstrate that our method achieves state-of-the-art (SOTA) performance.
Poster
Wei Liu · Yufei Chen · Xiaodong Yue

[ ExHall D ]

Abstract
Trusted multi-view classification (TMVC) addresses variations in data quality by evaluating the reliability of each view based on prediction uncertainty at the evidence level, reducing the impact of low-quality views commonly encountered in real-world scenarios. However, existing TMVC methods often struggle to maintain robustness during testing, particularly when integrating noisy or corrupted views. This limitation arises because the evidence collected by TMVC may be unreliable, frequently providing incorrect information due to complex view distributions and optimization challenges, ultimately leading to classification performance degradation. To enhance the robustness of TMVC methods in real-world conditions, we propose a generalized evidence filtering mechanism that is compatible with various fusion strategies commonly used in TMVC, including Belief Constraint Fusion, Aleatory Cumulative Belief Fusion, and Averaging Belief Fusion. Specifically, we frame the identification of unreliable evidence as a multiple testing problem and introduce p-values to control the risk of false identification. By selectively down-weighting unreliable evidence during testing, our mechanism ensures robust fusion and mitigates performance degradation. Both theoretical guarantees and empirical results demonstrate significant improvements in the classification performance of TMVC methods, supporting their reliable application in challenging, real-world environments.
Poster
Zhibin Dong · Meng Liu · Siwei Wang · KE LIANG · Yi Zhang · Suyuan Liu · Jiaqi Jin · Xinwang Liu · En Zhu

[ ExHall D ]

Abstract
Multi-view clustering aims to improve clustering accuracy by effectively integrating complementary information from multiple perspectives. However, existing methods often encounter challenges such as feature conflicts between views and insufficient enhancement of individual view features, which hinder clustering performance. To address these challenges, we propose a novel framework, EPFMVC, which integrates feature enhancement with progressive fusion to more effectively align multi-view data. Specifically, we introduce two key innovations: (1) a Feature Channel Attention Encoder (FCAencoder), which adaptively enhances the most discriminative features in each view, and (2) a View Graph-based Progressive Fusion Mechanism, which constructs a view graph using optimal transport (OT) distance to progressively fuse similar views while minimizing inter-view conflicts. By leveraging multi-head attention, the fusion process gradually integrates complementary information, ensuring more consistent and robust shared representations. These innovations enable superior representation learning and effective fusion across views. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art techniques, achieving notable improvements in multi-view clustering tasks across various datasets and evaluation metrics.
Poster
Zheming Xu · He Liu · Congyan Lang · Tao Wang · Yidong Li · Michael C. Kampffmeyer

[ ExHall D ]

Abstract
Recent graph-based multi-view clustering (GMVC) methods typically encode view features into high-dimensional spaces and construct graphs based on distance similarity. However, the high dimensionality of the embeddings often leads to the hubness problem, where a few points repeatedly appear in the nearest neighbor lists of other points. We show that this negatively impacts the extracted graph structures and message passing, thus degrading clustering performance. To the best of our knowledge, we are the first to highlight the detrimental effect of hubness in GMVC methods and introduce the hubREP (hub-aware Representation Embedding and Pairing) framework. Specifically, we propose a simple yet effective encoder that reduces hubness while preserving neighborhood topology within each view. Additionally, we propose a hub-aware pairing module to maintain structure consistency across views, efficiently enhancing the view-specific representations. The proposed hubREP is lightweight compared to the conventional autoencoders used in state-of-the-art GMVC methods and can be integrated into existing GMVC methods that mostly focus on novel fusion mechanisms, further boosting their performance. Comprehensive experiments performed on eight benchmarks confirm the superiority of our method. Code is included in the supplementary material.
Poster
Dileepa Pitawela · Gustavo Carneiro · Hsiang-Ting Chen

[ ExHall D ]

Abstract
In ordinal classification, misclassifying neighboring ranks is common, yet the consequences of these errors are not the same.For example, misclassifying benign tumor categories is less consequential, compared to an error at the pre-cancerous to cancerous threshold, which could profoundly influence treatment choices. Despite this, existing ordinal classification methods do not account for the varying importance of these margins, treating all neighboring classes as equally significant. To address this limitation, we propose CLOC, a new margin-based contrastive learning method for ordinal classification that learns an ordered representation based on the optimization of multiple margins with a novel multi-margin n-pair loss (MMNP).CLOC enables flexible decision boundaries across key adjacent categories, facilitating smooth transitions between classes and reducing the risk of overfitting to biases present in the training data.We provide empirical discussion regarding the properties of MMNP and show experimental results on five real-world image datasets (Adience, Historical Colour Image Dating, Knee Osteoarthritis, Indian Diabetic Retinopathy Image, and Breast Carcinoma Subtyping) and one synthetic dataset simulating clinical decision bias.Our results demonstrate that CLOC outperforms existing ordinal classification methods and show the interpretability and controllability of CLOC in learning meaningful, ordered representations that align with clinical and practical needs.
Poster
Siyi Du · Xinzhe Luo · Declan ORegan · Chen Qin

[ ExHall D ]

Abstract
Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code will be available on GitHub.
Poster
Jie Liu · Tiexin Qin · Hui Liu · Yilei Shi · Lichao Mou · Xiao Xiang Zhu · Shiqi Wang · Haoliang Li

[ ExHall D ]

Abstract
In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature of cardiac signals. To tackle these issues, we propose a novel \textbf{Q}uasi-\textbf{P}eriodic \textbf{A}daptive \textbf{R}egression with \textbf{T}est-time Training (Q-PART) framework. In the training stage, the proposed Quasi-Period Network decomposes the echocardiogram into periodic and aperiodic components within latent space by combining parameterized helix trajectories with Neural Controlled Differential Equations. During inference, our framework further employs a variance minimization strategy across image augmentations that simulate common quality issues in echocardiogram acquisition, along with differential adaptation rates for periodic and aperiodic components. Theoretical analysis is provided to demonstrate that our variance minimization objective effectively bounds the regression error under mild conditions. Furthermore, extensive experiments across three pediatric age groups demonstrate that Q-PART not only significantly outperforms existing approaches in pediatric LVEF prediction, but also exhibits strong clinical screening capability with high mAUROC scores (up to 0.9747) and maintains gender-fair performance across all metrics, validating its robustness and practical …
Poster
Bingzhi Chen · Sisi Fu · Xiaocheng Fang · Jieyi Cai · Boya Zhang · Minhua Lu · Yishu Liu

[ ExHall D ]

Abstract
In clinical practice, panoramic dental radiography is a widely employed imaging technique that can provide a detailed and comprehensive view of dental structures and surrounding tissues for identifying various oral anomalies. However, due to the complexity of oral anomalies and the scarcity of available data, existing research still suffers from substantial challenges in automated oral anomaly detection. To this end, this paper presents a new hospital-scale panoramic X-ray benchmark, namely “OralXrays-9”, which consists of 12,688 panoramic X-ray images with 84,113 meticulously annotated instances across nine common oral anomalies. Correspondingly, we propose a personalized Multi-Object Query-Aware Mining (MOQAM) paradigm, which jointly incorporates the Distribution-IoU Region Proposal Network (DI-RPN) and Class-Balanced Spherical Contrastive Regularization (CB-SCR) mechanisms to address the challenges posed by multi-scale variations and class-imbalanced distributions.To the best of our knowledge, this is the first attempt to develop AI-driven diagnostic systems specifically designed for multi-object oral anomaly detection, utilizing publicly available data resources. Extensive experiments on the newly-published OralXrays-9 dataset and real-world nature scenarios consistently demonstrate the superiority of our MOQAM in revolutionizing oral healthcare practices.
Poster
Sang-Jun Park · Keun-Soo Heo · Dong-Hee Shin · Young-Han Son · Ji-Hye Oh · Tae-Eui Kam

[ ExHall D ]

Abstract
The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on the MIMIC-CXR and IU X-ray benchmarks, surpassing previous approaches in both report generation and disease classification, thereby enhancing the trustworthiness of radiology reports.
Poster
Zhengrui Guo · Conghao Xiong · Jiabo MA · Qichen Sun · Lishuang Feng · Jinzhuo Wang · Hao Chen

[ ExHall D ]

Abstract
Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches, where a significant portion of these patches lacks diagnostically relevant information, potentially diluting the model's ability to learn and focus on critical diagnostic features. While recent works attempt to address this by incorporating additional knowledge, several crucial gaps hinder further progress: (1) despite the emergence of powerful pathology foundation models (FMs), their potential remains largely untapped, with most approaches limiting their use to basic feature extraction; (2) current language guidance mechanisms attempt to align text prompts with vast numbers of WSI patches all at once, struggling to leverage rich pathological semantic information. To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed **FOCUS**, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. Our approach implements a progressive three-stage compression strategy: we first leverage FMs for global visual …
Poster
Tingting Zheng · Kui Jiang · Yi Xiao · Sicheng Zhao · Hongxun Yao

[ ExHall D ]

Abstract
Multi-instance learning (MIL) has demonstrated impressive performance in whole slide image (WSI) analysis. However, existing approaches struggle with undesirable results and unbearable computational overhead due to the quadratic complexity of Transformers. Recently, Mamba has offered a feasible solution for modeling long-range dependencies with linear complexity. However, vanilla Mamba inherently suffers from contextual forgetting issues, making it ill-suited for capturing global dependencies across instances in large-scale WSIs. To address this, we propose a memory-driven Mamba network, dubbed M3amba, to fully explore the global latent relations among instances. Specifically, M3amba retains and iteratively updates historical information with a dynamic memory bank (DMB), thus overcoming the catastrophic forgetting defects of Mamba for long-term context representation. For better feature representation, M3amba involves an intra-group bidirectional Mamba (BiMamba) block to refine local interactions within groups. Meanwhile, we additionally perform cross-attention fusion to incorporate relevant historical information across groups, facilitating richer inter-group connections. The joint learning of inter- and intra-group representations with memory merits enables M3amba with a more powerful capability for achieving accurate and comprehensive WSI representation. Extensive experiments on four datasets demonstrate that M3amba outperforms the state-of-the-art by 6.2\% and 7.0\% in accuracy on the TCGA BRAC and TCGA Lung datasets while maintaining low …
Poster
Aniruddha Ganguly · Debolina Chatterjee · Wentao Huang · Jie Zhang · Alisa Yurovsky · Travis Steele Johnson · Chao Chen

[ ExHall D ]

Abstract
Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.
Poster
Xingguo Lv · Xingbo Dong · Liwen Wang · Jiewen Yang · Lei Zhao · Bin Pu · Zhe Jin · Xuejun Li

[ ExHall D ]

Abstract
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. We will make all codes publicly available.
Poster
Zhenhui Ding · Guilian Chen · Qin Zhang · Huisi Wu · Jing Qin

[ ExHall D ]

Abstract
Accurate automatic breast ultrasound (BUS) image segmentation is essential for early screening and diagnosis of breast cancer. It is, however, a quite challenging task owing to (1) the large variation in the scale and shape of breast lesions, (2) the ambiguous boundaries caused by extensive speckle noise and artifacts in BUS images, and (3) the scarcity of high-quality pixel-level annotations. Most existing semi-supervised methods employ the mean-teacher architecture, which merely learns semantic information within a single image and heavily relies on the performance of the teacher model. Given the vulnerability of this framework, we present a novel cross-image semantic correlation semi-supervised framework, named CSC-PA, to improve the performance of BUS image segmentation. CSC-PA is trained based on a single network, which integrates a foreground prototype attention (FPA) and an edge prototype attention (EPA). Specifically, channel prototypes and an attention mechanism are used in the FPA to transfer complementary foreground information between labeled and unlabeled images, achieving more stable and complete lesion segmentation. On the other hand, EPA is proposed to enhance edge features of lesions by using edge prototype. To achieve this, we design a novel adaptive edge container to store global edge features and generate the edge prototype. Additionally, …
Poster
Yuan Guo · Jingyu Kong · Yu Wang · Yuping Duan

[ ExHall D ]

Abstract
Medical image segmentation is vital for clinical applications, with hard samples playing a key role in segmentation accuracy. We propose an effective image segmentation framework that includes mechanisms for identifying and segmenting hard samples. It derives a novel image segmentation paradigm: 1) Learning to identify hard samples: automatically selecting inherent hard samples from different datasets, and 2) Learning to segment hard samples: achieving the segmentation of hard samples through effective feature augmentation on dedicated networks. We name our method Learning to Segment hard samples" (L2S). The hard sample identification module comprises a backbone model and a classifier, which dynamically uncovers inherent dataset patterns. The hard sample segmentation module utilizes the diffusion process for feature augmentation and incorporates a more sophisticated segmentation network to achieve precise segmentation. We justify our motivation through solid theoretical analysis and extensive experiments. Evaluations across various modalities show that our L2S outperforms other SOTA methods, particularly by substantially improving the segmentation accuracy of hard samples. On the ISIC dataset, our L2S improves the Dice score on hard samples and overall segmentation by 8.97\% and 1.01\%, respectively, compared to SOTA methods.
Poster
Jie Mei · Chenyu Lin · Yu Qiu · Yaonan Wang · Hui Zhang · Ziyang Wang · Dong Dai

[ ExHall D ]

Abstract
Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based segmentation models are expected to address these problems, however, most existing datasets are small-scale and private, which is insufficient to support significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. All images are manually labeled with pixel-level tumor masks by experienced doctors. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA …
Poster
Tianyi Liu · Haochuan Jiang · Kaizhu Huang

[ ExHall D ]

Abstract
Magnetic resonance imaging (MRI), with modalities including T1, T2, T1ce, and Flair, providing complementary information critical for sub-region analysis, is widely used for brain tumor diagnosis. However, clinical practice often suffers from varying degrees of incompleteness of necessary modalities due to reasons such as susceptibility to artifacts. It significantly impairs segmentation model performance. Given the limited available modalities at hand, existing approaches attempt to project them into a shared latent space. However, they ignore decomposing the modality-shared and modality-specific information and failed to construct the relationship among different modalities. Such deficiency limits the effectiveness of the segmentation performance, particularly at a time when the amount of data in each modality is different. In this paper, we propose the plug-and-play Koopman Multi-modality Decomposition (KMD) module, leveraging the Koopman Invariant Subspace to disentangle modality-common and modality-specific information. It is capable of constructing modality relationships that minimize bias toward modalities across various modality-incomplete scenarios. More importantly, it can be integrated into several existing backbones feasibility. Through theoretical deductions and extensive empirical experiences on the BraTS2018 and BraTS2020 datasets, we have sufficiently demonstrated the effectiveness of the proposed KMD to promote generalization performance.
Poster
Kunpeng Qiu · Zhiqiang Gao · Zhiying Zhou · MINGJIE SUN · Yongxin Guo

[ ExHall D ]

Abstract
Deep learning has revolutionized medical image segmentation, but its full potential is limited by the scarcity of annotated datasets. Diffusion models are used to generate synthetic image-mask pairs to expand these datasets, yet they also face the same data scarcity issues they aim to address. Traditional mask-only models often produce low-fidelity images due to insufficient generation of morphological characteristics, which can catastrophically undermine the reliability of segmentation models. To enhance morphological fidelity, we propose the Siamese-Diffusion model, which incorporates both image and mask prior controls during training and switches to mask-only guidance during sampling to preserve diversity and scalability. This model, comprising both Mask-Diffusion and Image-Diffusion, ensures high morphological fidelity by introducing a Noise Consistency Loss between the two diffusion processes, guiding the convergence trajectory of Mask-Diffusion toward higher-fidelity local minima in the parameter space. Extensive experiments validate the superiority of our method: with Siamese-Diffusion, SANet achieves mDice and mIoU improvements of 3.6% and 4.4% on the Polyps dataset, while UNet shows mDice and mIoU improvements of 1.52% and 1.64% on the ISIC2018 dataset. Code will be released.
Poster
Chun-Hung Wu · Shih-Hong Chen · Chih Yao Hu · Hsin-Yu Wu · Kai-Hsin Chen · Yu-You Chen · Chih-Hai Su · Chih-Kuo Lee · Yu-Lun Liu

[ ExHall D ]

Abstract
This paper presents **De**formable **N**eural **Ve**ssel **R**epresentations (DeNVeR), an unsupervised approach for vessel segmentation in X-ray angiography videos without annotated ground truth. DeNVeR utilizes optical flow and layer separation techniques, enhancing segmentation accuracy and adaptability through test-time training. Key contributions include a novel layer separation bootstrapping technique, a parallel vessel motion loss, and the integration of Eulerian motion fields for modeling complex vessel dynamics. A significant component of this research is the introduction of the XACV dataset, the first X-ray angiography coronary video dataset with high-quality, manually labeled segmentation ground truth. Extensive evaluations on both XACV and CADICA datasets demonstrate that DeNVeR outperforms current state-of-the-art methods in vessel segmentation accuracy and generalization capability while maintaining temporal coherency.
Poster
Zhifeng Wang · Renjiao Yi · Xin Wen · Chenyang Zhu · Kai Xu

[ ExHall D ]

Abstract
Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis in multiple modalities and anatomical regions.

Oral Session 4A: Image and Video Synthesis Sat 14 Jun 01:00 p.m.  

Oral
Jingfeng Yao · Bin Yang · Xinggang Wang

[ Karl Dean Ballroom ]

Abstract
Latent diffusion models (LDM) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: increasing the per-token feature dimension in visual tokenizers improves reconstruction quality but requires substantially larger diffusion models and extended training time to maintain generation performance. This results in prohibitively high computational costs, making high-dimensional tokenizers impractical. In this paper, we argue that this limitation stems from the inherent difficulty of learning unconstrained high-dimensional latent spaces and address this limitation by aligning the latent space with pre-trained vision foundation models. Our VA-VAE (Vision foundation model Aligned Variational AutoEncoder) expands the Pareto frontier of visual tokenizers, enabling 2.7 times faster Diffusion Transformers (DiT) convergence in high-dimensional latent space. To further validate our approach, we optimize a DiT baseline, referred to as LightningDiT, achieving superior performance on class conditional generation with only 6% of the original training epochs. The integrated system demonstrates the effectiveness of VA-VAE, achieving 0.28 rFID and 1.73 gFID on ImageNet-256 generation in 400 epochs—outperforming the original DiT's 0.71 rFID and 2.27 gFID in 1400 epochs, without more complex designs. To our knowledge, this marks the first latent diffusion system to achieve both superior generation and reconstruction …
Oral
Kaiwen Zha · Lijun Yu · Alireza Fathi · David A. Ross · Cordelia Schmid · Dina Katabi · Xiuye Gu

[ Karl Dean Ballroom ]

Abstract
Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focusing on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2\% and 48.1\% on ImageNet 256×256 and 512×512 benchmarks respectively, across varying number of tokens. These tokenization improvements consistently translate to 16.3\% and 34.3\% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve 93.5× inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 …
Oral
Qingyu Shi · Lu Qi · Jianzong Wu · Jinbin Bai · Jingbo Wang · Yunhai Tong · Xiangtai Li

[ Karl Dean Ballroom ]

Abstract
Customized image generation is essential for delivering personalized content based on user-provided prompts, enabling large-scale text-to-image diffusion models to better align with individual needs. However, existing models often neglect the relationships between customized objects in generated images. In contrast, this work addresses this gap by focusing on relation-aware customized image generation, which seeks to preserve the identities from image prompts while maintaining the predicate relations specified in text prompts. Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges—generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on our proposed benchmarks demonstrate the superiority of DreamRelation in generating precise relations while preserving object identities across a diverse set …
Oral
Jian Han · Jinlai Liu · Yi Jiang · Bin Yan · Yuqi Zhang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu

[ Karl Dean Ballroom ]

Abstract
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity refactors visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary classifier and bitwise self-correction mechanism. By theoretically expanding the tokenizer vocabulary size to infinity in Transformer, our method significantly unleashes powerful scaling capabilities to infinity compared to vanilla VAR. Extensive experiments indicate Infinity outperforms AutoRegressive Text-to-Image models by large margins, matches or surpasses leading diffusion models. Without extra optimization, Infinity generates a 1024×1024 image in 0.8s, 2.6× faster than SD3-Medium, making it the fastest Text-to-Image model. Models and codes will be released to promote the further exploration of Infinity for visual generation.
Oral
Yeongmin Kim · Sotiris Anagnostidis · Yuming Du · Edgar Schoenfeld · Jonas Kohler · Markos Georgopoulos · Albert Pumarola · Ali Thabet · Artsiom Sanakoyeu

[ Karl Dean Ballroom ]

Abstract
Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a 5× reduction in FID degradation compared …

Oral Session 4B: Embodied Computer Vision Sat 14 Jun 01:00 p.m.  

Oral
Jingyi Tian · Le Wang · Sanping Zhou · Sen Wang · lijiayi · Haowen Sun · Wei Tang

[ ExHall A2 ]

Abstract
Robotic manipulation based on visual observations and natural language instructions is a long-standing challenge in robotics. Yet prevailing approaches model action distribution by adopting explicit or implicit representations, which often struggle to achieve a trade-off between accuracy and efficiency. In response, we propose PDFactor, a novel framework that models action distribution with a hybrid triplane representation. In particular, PDFactor decomposes 3D point cloud into three orthogonal feature planes and leverages a tri-perspective view transformer to produce dense cubic features as a latent diffusion field aligned with observation space representing 6-DoF action probability distribution at an arbitrary location. We employ a small denoising network conceptually as both a parameterized loss function measuring the quality of the learned latent features and an action gradient decoder to sample actions from the latent diffusion field during inference. This design enables our PDFactor to benefit from spatial awareness of explicit representation and arbitrary resolution of implicit representation, rendering it with manipulation accuracy, inference efficiency, and model scalability. Experiments demonstrate that PDFactor outperforms state-of-the-art approaches across a diverse range of manipulation tasks in RLBench simulation. Moreover, PDFactor can effectively learn multi-task policies from a limited number of human demonstrations, achieving promising accuracy in a variety of …
Oral
Chan Hee Song · Valts Blukis · Jonathan Tremblay · Stephen Tyree · Yu Su · Stan Birchfield

[ ExHall A2 ]

Abstract
Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension — spatial relationships require clear contextual understanding, whether from a ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and ego-centric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
Oral
Jieming Cui · Tengyu Liu · Ziyu Meng · Jiale Yu · Ran Song · Wei Zhang · Yixin Zhu · Siyuan Huang

[ ExHall A2 ]

Abstract
Learning open-vocabulary physical skills for simulated agents remains challenging due to the limitations of reinforcement learning approaches: manually designed rewards lack scalability, while demonstration-based methods struggle to cover arbitrary tasks. We propose GROVE, a generalized reward framework for open-vocabulary physical skill learning without manual reward design or task-specific demonstrations. GROVE uniquely combines Large Language Models (LLMs) for generating precise constraints with Vision Language Models (VLMs) for semantic evaluation. Through an iterative reward design process, VLM-based feedback guides the refinement of LLM-generated constraints, significantly enhancing the reliability of our method. Central to our approach is Pose2CLIP, a lightweight pose-to-semantic feature mapper that significantly enhances the quality and efficiency of VLM evaluation. Extensive experiments demonstrate GROVE's versatility across diverse tasks and learning paradigms. Our approach achieves 22.2% higher naturalness and 25.7% better task completion score while training 8.4 times faster than previous open-vocabulary methods, establishing a new foundation for scalable physical skill acquisition.
Oral
Amir Bar · Gaoyue Zhou · Danny Tran · Trevor Darrell · Yann LeCun

[ ExHall A2 ]

Abstract
Navigation is a fundamental skill of agents with visual-motor capabilities. We propose a Navigation World Model (NWM), a controllable video generation model that predicts the future visual observation given the past observations and navigation actions. NWM is a Conditional Diffusion Transformer (CDiT) trained on the video footage of robots as well as unlabeled egocentric video data. We scale the model up to 1B parameters and train it over human and robot agents data from numerous environments and embodiments. Our model scales favorably on known and unknown environments and can leverage unlabeled egocentric video data. NWM exhibits improved navigation planning skills either by planning from scratch or by ranking proposals from an external navigation policy. Compared to existing supervised navigation models which are hard coded'', NWM can incorporate new constraints when planning trajectories. NWM learns visual priors that enable it to imagine navigation trajectories based on just a single input image.
Oral
Mi Luo · Zihui Xue · Alex Dimakis · Kristen Grauman

[ ExHall A2 ]

Abstract
Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical for applications in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignments between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and egocentric video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.

Oral Session 4C: 3D Computer Vision Sat 14 Jun 01:00 p.m.  

Oral
Zhengxue Wang · Zhiqiang Yan · Jinshan Pan · Guangwei Gao · Kai Zhang · Jian Yang

[ Davidson Ballroom ]

Abstract
Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradations, outperforming existing methods.
Oral
Bangyan Liao · Zhenjun Zhao · Haoang Li · Yi Zhou · Yingping Zeng · Hao Li · Peidong Liu

[ Davidson Ballroom ]

Abstract
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a “soft” association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs’ locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called GlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that GlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous …
Oral
Yuhui Liu · Liangxun Ou · Qiang Fu · Hadi Amata · Wolfgang Heidrich · YIFAN PENG

[ Davidson Ballroom ]

Abstract
Extracting high-fidelity RGBD information from two-dimensional (2D) images is essential for various visual computing applications. Stereo imaging, as a reliable passive imaging technique for obtaining three-dimensional (3D) scene information, has benefited greatly from deep learning advancements. However, existing stereo depth estimation algorithms struggle to perceive high-frequency information and resolve high-resolution depth maps in realistic camera settings with large depth variations. These algorithms commonly neglect the hardware parameter configuration, limiting the potential for achieving optimal solutions solely through software-based design strategies.This work presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps over a wide depth range. A pair of rank-2 parameterized diffractive optical elements (DOEs) is employed to encode perpendicular complementary information optically during stereo acquisitions. Additionally, we employ an IGEV-UNet-fused neural network tailored to the proposed rank-2 encoding for stereo matching and image reconstruction. Through prototyping a stereo camera with customized DOEs, our deep stereo imaging paradigm has demonstrated superior performance over existing monocular and stereo imaging systems in both image PSNR by 2.96 dB gain and depth accuracy in high-frequency details across distances from 0.67 to 8 meters.
Oral
Juan Carlos Dibene Simental · Enrique Dunn

[ Davidson Ballroom ]

Abstract
We present a marker-based geometric estimation framework for the absolute pose of a camera by analyzing the 1D observations in a single radially distorted pixel scanline.We leverage a pair of known co-planar pencils of lines, along with lens distortion parameters, to propose an ensemble of solvers exploring the space of estimation strategies applicable to our setup.First, we present a minimal algebraic solver requiring only six measurements and yielding eight solutions, which relies on the intersection of two conics defined by one of the pencils of lines.Then, we present a unique closed-form geometric solver from seven measurements.Finally, we present an homography-based formulation amenable to linear least-squares from eight or more measurements.Our geometric framework constitutes a theoretical analysis on the minimum geometric context necessary to solve in closed form for the absolute pose of a single camera from a single radially distorted scanline.
Oral
Sotiris Nousias · Mian Wei · Howard Xiao · Maxx Wu · Shahmeer Athar · Kevin J Wang · Anagh Malik · David A. Barmherzig · David B. Lindell · Kiriakos Kutulakos

[ Davidson Ballroom ]

Abstract
Scattered light from pulsed lasers is increasingly part of our ambient illumination, as many devices rely on them for active 3D sensing. In this work, we ask: can these “ambient” light signals be detected and leveraged for passive 3D vision? We show that pulsed lasers, despite being weak and fluctuating at MHz to GHz frequencies, leave a distinctive sinc comb pattern in the temporal frequency domain of incident flux that is specific to each laser and invariant to the scene. This enables their passive detection and analysis with a free-running SPAD camera, even when they are unknown, asynchronous, out of sight, and emitting concurrently. We show how to synchronize with such lasers computationally, characterize their pulse emissions, separate their contributions, and—if many are present—localize them in 3D and recover a depth map of the camera’s field of view. We use our camera prototype to demonstrate (1) a first-of-its-kind visualization of asynchronously propagating light pulses from multiple lasers through the same scene, (2) passive estimation of a laser’s MHz-scale pulse repetition frequency with mHz precision, and (3) mm-scale 3D imaging over room-scale distances by passively harvesting photons from two or more out-of-view lasers.

Invited Talk: Laurens Van der Maaten

Keynote 2: Laurens van der Maaten




Poster Session 4 Sat 14 Jun 05:00 p.m.  

Poster
Gaoxiang Cong · Jiadong Pan · Liang Li · Yuankai Qi · Yuxin Peng · Anton van den Hengel · Jian Yang · Qingming Huang

[ ExHall D ]

Abstract
Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user's emotion instructions by the positive and negative guidance …
Poster
Jihoon Kim · Jeongsoo Choi · Jaehun Kim · Chaeyoung Jung · Joon Chung

[ ExHall D ]

Abstract
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis.A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech.This is achieved by learning of hierarchical representations from video to speech.Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages--content, timbre, and prosody modeling.In each stage, we align visual factors---lip movements, face identity, and facial expressions---with corresponding acoustic counterparts to ensure the seamless transformation.Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution.Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
Poster
Yinuo Wang · Yanbo Fan · Xuan Wang · Yu Guo · Fei Wang

[ ExHall D ]

Abstract
Listening head generation aims to synthesize non-verbal responsive listening head videos that naturally react to a certain speaker, for which, both realistic head movements, expressive facial expressions, and high visual qualities are expected. Previous approaches typically follow a two-stage pipeline that first generates intermediate 3D motion signals such as 3DMM coefficients, and then synthesizes the videos by deterministic rendering, suffering from limited motion expressiveness and low visual quality (eg, 256×256). In this work, we propose a novel listening head generation method that harnesses the generative capabilities of the diffusion model for both motion generation and high-quality rendering. Crucially, we propose an effective hybrid motion modeling module that addresses training difficulties caused by the scarcity of listening head data while preserving the intricate details that may be lost in explicit motion representations. We further develop a tailored control guidance for head pose and facial expression, by integrating their intrinsic motion characteristics. Our method enables high-fidelity video generation with 512×512 resolution and delivers vivid listener motion feedback. We conduct comprehensive experiments and obtain superior performance in terms of both visual quality and motion expressiveness compared to existing methods.
Poster
Enric Corona · Andrei Zanfir · Eduard Gabriel Bazavan · NIKOS KOLOTOUROS · Thiemo Alldieck · Cristian Sminchisescu

[ ExHall D ]

Abstract
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate.We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions.VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering lip-syncing, image quality, identity preservation and temporal consistency while also generating upper-body gestures.We analyze the performance of VLOGGER with respect to multiple diversity …
Poster
Zunnan Xu · Zhentao Yu · Zixiang Zhou · Jun Zhou · Xiaoyu Jin · Fa-Ting Hong · Xiaozhong Ji · Junwei Zhu · Chengfei Cai · Shiyu Tang · Qin Lin · Xiu Li · qinglin lu

[ ExHall D ]

Abstract
We introduce ImPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, ImPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion (SVD) as the main building block, we carefully design adapter layers to inject control signals into denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. ImPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability.
Poster
Jianwen Jiang · Gaojie Lin · Zhengkun Rong · Chao Liang · Yongming Zhu · Jiaqi Yang · Tianyun Zhong

[ ExHall D ]

Abstract
Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and by using simple U-Nets as backbones, our method achieves performance on par with state-of-the-art methods, while requiring less than one-tenth of the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs.
Poster
Wojciech Zielonka · Timo Bolkart · Thabo Beeler · Justus Thies

[ ExHall D ]

Abstract
Current personalized neural head avatars face a trade-off: lightweight models lack detail and realism, while high-quality, animatable avatars require significant computational resources, making them unsuitable for commodity devices. To address this gap, we introduce Gaussian Eigen Models (GEM), which provide high-quality, lightweight, and easily controllable head avatars. GEM utilizes 3D Gaussian primitives for representing the appearance combined with Gaussian splatting for rendering. Building on the success of mesh-based 3D morphable face models (3DMM), we define GEM as an ensemble of linear eigenbases for representing the head appearance of a specific subject. In particular, we construct linear bases to represent the position, scale, rotation, and opacity of the 3D Gaussians. This allows us to efficiently generate Gaussian primitives of a specific head shape by a linear combination of the basis vectors, only requiring a low-dimensional parameter vector that contains the respective coefficients. We propose to construct these linear bases (GEM) by distilling high-quality compute-intense CNN-based Gaussian avatar models that can generate expression-dependent appearance changes like wrinkles. These high-quality models are trained on multi-view videos of a subject and are distilled using a series of principle component analyses. Once we have obtained the bases that represent the animatable appearance space of a …
Poster
Zhenglin Zhou · Fan Ma · Hehe Fan · Tat-seng Chua

[ ExHall D ]

Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo-ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the video diffusion generation.To address this issue, we propose Symbiotic GENeration (SymGEN), a robust method that synthesizes spatial and temporal consistency datasets for 4D avatar reconstruction using the video diffusion model.Specifically, SymGEN iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that SymGEN improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation.The code will be publicly available.
Poster
Hyunsoo Cha · Inhee Lee · Hanbyul Joo

[ ExHall D ]

Abstract
We present PERSE, a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual's identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in the facial expression and viewpoint, combined with a variation in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on the 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving identity of reference person.
Poster
Zihao Huang · Shoukang Hu · Guangcong Wang · Tianqi Liu · Yuhang Zang · Zhiguo Cao · Wei Li · Ziwei Liu

[ ExHall D ]

Abstract
Existing research on avatar creation is typically limited to laboratory datasets, which require high costs against scalability and exhibit insufficient representation of the real world. On the other hand, the web abounds with off-the-shelf real-world human videos, but these videos vary in quality and require accurate annotations for avatar creation. To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. Our pipeline surpasses state-of-the-art methods on the EMDB benchmark, and the filtering protocols boost verification metrics on web videos. We then curate WildAvatar, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with 10,000+ different human subjects and scenes. WildAvatar is at least 10× richer than previous datasets for 3D human avatar creation and closer to the real world. To explore its potential, we demonstrate the quality and generalizability of avatar creation methods on WildAvatar. We will publicly release our code, data source links and annotations to push forward 3D human avatar creation and other related fields for real-world applications.
Poster
Hanxi Liu · Yifang Men · Zhouhui Lian

[ ExHall D ]

Abstract
Personalized 3D avatar editing holds significant promise due to its user-friendliness and availability to applications such as AR/VR and virtual try-ons. Previous studies have explored the feasibility of 3D editing, but often struggle to generate visually pleasing results, possibly due to the unstable representation learning under mixed optimization of geometry and texture in complicated reconstructed scenarios. In this paper, we aim to provide an accessible solution for ordinary users to create their editable 3D avatars with precise region localization, geometric adaptability, and photorealistic renderings. To tackle this challenge, we introduce a meticulously designed framework that decouples the editing process into local spatial adaptation and realistic appearance learning, utilizing a hybrid Tetrahedron-constrained Gaussian Splatting (TetGS) as the underlying representation. TetGS combines the controllable explicit structure of tetrahedral grids with the high-precision rendering capabilities of 3D Gaussian Splatting and is optimized in a progressive manner comprising three stages: 3D avatar instantiation from real-world monocular videos to provide accurate priors for TetGS initialization; localized spatial adaptation with explicitly partitioned tetrahedrons to guide the redistribution of Gaussian kernels; and geometry-based appearance generation with a coarse-to-fine activation strategy. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in generating photorealistic 3D …
Poster
Hang Ye · Xiaoxuan Ma · Hai Ci · Wentao Zhu · Yizhou Wang

[ ExHall D ]

Abstract
Achieving realistic animated human avatars requires accurate modeling of pose-dependent clothing deformations. Existing learning-based methods heavily rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like SMPL to model deformation. However, these methods struggle to handle loose clothing, such as long dresses, where the canonicalization process becomes ill-defined when the clothing is far from the body, leading to disjointed and fragmented results. To overcome this limitation, we propose a novel hybrid framework to model challenging clothed humans. Our core idea is to use dedicated strategies to model different regions, depending on whether they are close to or distant from the body. This free-form generation paradigm brings enhanced flexibility and expressiveness to our hybrid framework, enabling it to capture the intricate geometric details of challenging loose clothing, such as skirts and dresses. Experimental results on the benchmark dataset featuring loose clothing demonstrate that our method achieves state-of-the-art performance with superior visual fidelity and realism, particularly in the most challenging cases.
Poster
Chaoyue Song · Jianfeng Zhang · Xiu Li · Fan Yang · Yiwen Chen · Zhongcong Xu · Jun Hao Liew · Xiaoyang Guo · Fayao Liu · Jiashi Feng · Guosheng Lin

[ ExHall D ]

Abstract
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. We will release our dataset and model to support further research.
Poster
Peng Li · Wangguandong Zheng · Yuan Liu · Tao Yu · Yangguang Li · Xingqun Qi · Xiaowei Chi · Siyu Xia · Yan-Pei Cao · Wei Xue · Wenhan Luo · Yike Guo

[ ExHall D ]

Abstract
Photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, existing methods for monocular full-body reconstruction, typically relying on front and/or predicted back view, still struggle with satisfactory performance due to the ill-posed nature of the problem and sophisticated self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multiview normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experiments on CAPE and THuman2.1 datasets demonstrate PSHuman's superiority in geometry details, texture fidelity, and generalization capability.
Poster
Jiaqi Liu · Jichao Zhang · Paolo Rota · Nicu Sebe

[ ExHall D ]

Abstract
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model’s ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code will be released upon acceptance.
Poster
Nannan Zhang · Yijiang Li · Dong Du · Zheng Chong · Zhengwentai Sun · Jianhao Zeng · Yusheng Dai · Zhenyu Xie · Hairui Zhu · Xiaoguang Han

[ ExHall D ]

Abstract
This paper tackles the emerging challenge of multi-view virtual try-on, utilizing both front- and back-view clothing images as inputs. Extending frontal try-on methods to a multi-view context is not straightforward. Simply concatenating the two input views or encoding their features for a generative model, such as a diffusion model, often fails to produce satisfactory results. The main challenge lies in effectively extracting and fusing meaningful clothing features from these input views. Existing explicit warping-based methods, which establish direct correspondence between input and target views, tend to introduce artifacts, particularly when there is a significant disparity between the input and target views. Conversely, implicit encoding-based methods often lose spatial information about clothing, resulting in outputs that lack detail. To overcome these challenges, we propose Robust-MVTON, an end-to-end method for robust and high-quality multi-view try-ons. Our approach introduces a novel cross-pose feature alignment technique to guide the fusion of clothing features and incorporates a newly designed loss function for training. With the fused multi-scale clothing features, we employ a coarse-to-fine diffusion model to generate realistic and detailed results. Extensive experiments conducted on the Deepfashion and MPV datasets affirm the superiority of our method, achieving state-of-the-art performance.
Poster
Yang Zheng · Menglei Chai · Delio Vicini · Yuxiao Zhou · Yinghao Xu · Leonidas Guibas · Gordon Wetzstein · Thabo Beeler

[ ExHall D ]

Abstract
We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance.
Poster
Xingyu Ren · Jiankang Deng · Yuhao Cheng · Wenhan Zhu · Yichao Yan · Xiaokang Yang · Stefanos Zafeiriou · Chao Ma

[ ExHall D ]

Abstract
Recent 3D face reconstruction methods have made remarkable advancements, yet achieving high-quality facial reflectance from monocular input remains challenging. Existing methods rely on the light-stage captured data to learn facial reflectance models. However, limited subject diversity in these datasets poses challenges in achieving good generalization and broad applicability. This motivates us to explore whether the extensive priors captured in recent generative diffusion models (e.g., Stable Diffusion) can enable more generalizable facial reflectance estimation as these models have been pre-trained on large-scale internet image collections containing rich visual patterns. In this paper, we introduce the use of Stable Diffusion as a prior for facial reflectance estimation, achieving robust results with minimal captured data for fine-tuning. We present S3-Face, a comprehensive framework capable of producing SSS-compliant skin reflectance from in-the-wild images. Our method adopts a two-stage training approach: in the first stage, DSN-Net is trained to predict diffuse albedo, specular albedo, and normal maps from in-the-wild images using a novel joint reflectance attention module. In the second stage, HM-Net is trained to generate hemoglobin and melanin maps based on the diffuse albedo predicted in the first stage, yielding SSS-compliant and detailed reflectance maps. Extensive experiments demonstrate that our method achieves strong generalization …
Poster
Yizhilv · Xiao Lu · Hong Ding · Jingbo Hu · Zhi Jiang · Chunxia Xiao

[ ExHall D ]

Abstract
Eyeglass reflection removal can restore the texture information in the reflection destructed eye area, which is meaningful for various tasks on facial images. It is still challenging to correctly eliminate reflections, reasonably restore the lost contents, and guarantee that the final result has consistent color and illumination with the input image. In this paper, we introduce a Degradation-guided Local-to-Global (DL2G) restoration method to address this problem. We first propose a multiplicative reflection degradation model, which is used to alleviate the reflection degradation to obtain a preliminary result. Then, in the local details restoration stage, we propose a local structure-aware diffusion model to learn the true distribution of texture details in the eye area. This helps for recovering the lost contents in the heavy degradation regions where the background is invisible. Finally, in the global consistency refinement stage, we utilize the input image as reference image to generate the final result that is consistent with the input image in color and illumination. Extensive experiments demonstrate that our method can improve the effect of reflection removal and generate results with more reasonable semantics, exquisite details, and harmonious illumination.
Poster
yuxuan Gu · Huaian Chen · Yi Jin · Haoxuan Wang · Pengyang Ling · ZHIXIANG WEI · Enhong Chen

[ ExHall D ]

Abstract
In this paper, we observe that the collaboration of various foundation models can perceive semantic and degraded information within images, thereby guiding the low-light enhancement process. Specifically, we propose a self-supervised low-light enhancement framework based on the multiple foundation models collaboration (dubbed FoCo), aimed at improving both the visual quality of enhanced images and the performance in high-level applications. At the feature level, FoCo leverages the rich features from various foundation models to enhance the model's semantic perception during training, thereby reducing the gap between enhanced results and high-quality images from a high-level perspective. At the task level, we exploit the robustness-gap between strong foundation models and weak models, applying high-level task guidance to the low-light enhancement training process. Through the collaboration of multiple foundation models, the proposed framework shows better enhancement performance and adapts better to high-level tasks. Extensive experiments across various enhancement and application benchmarks demonstrate the qualitative and quantitative superiority of the proposed method over numerous state-of-the-art techniques.
Poster
Shuangfan Zhou · Chu Zhou · Youwei Lyu · Heng Guo · Zhanyu Ma · Boxin Shi · Imari Sato

[ ExHall D ]

Abstract
Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that make polarization-related parameters such as the degree of polarization (DoP) and angle of polarization (AoP) prone to error. Besides, limited by the hardware design, the resolution of a polarization camera is often much lower than that of a conventional RGB camera. Existing polarized image demosaicing (PID) methods are limited in that they cannot enhance resolution, while polarized image super-resolution (PISR) methods, though designed to obtain high-resolution (HR) polarized images from the demosaicing results, tend to retain or even amplify errors in the DoP and AoP introduced by demosaicing artifacts. In this paper, we propose PIDSR, a joint framework that performs complementary Polarized Image Demosaicing and Super-Resolution, showing the ability to robustly obtain high-quality HR polarized images with more accurate DoP and AoP from a CPFA raw image in a direct manner. Experiments show our PIDSR not only achieves state-of-the-art performance on both synthetic and real data, but also facilitates downstream tasks.
Poster
Yuhui Liu · Liangxun Ou · Qiang Fu · Hadi Amata · Wolfgang Heidrich · YIFAN PENG

[ ExHall D ]

Abstract
Extracting high-fidelity RGBD information from two-dimensional (2D) images is essential for various visual computing applications. Stereo imaging, as a reliable passive imaging technique for obtaining three-dimensional (3D) scene information, has benefited greatly from deep learning advancements. However, existing stereo depth estimation algorithms struggle to perceive high-frequency information and resolve high-resolution depth maps in realistic camera settings with large depth variations. These algorithms commonly neglect the hardware parameter configuration, limiting the potential for achieving optimal solutions solely through software-based design strategies.This work presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps over a wide depth range. A pair of rank-2 parameterized diffractive optical elements (DOEs) is employed to encode perpendicular complementary information optically during stereo acquisitions. Additionally, we employ an IGEV-UNet-fused neural network tailored to the proposed rank-2 encoding for stereo matching and image reconstruction. Through prototyping a stereo camera with customized DOEs, our deep stereo imaging paradigm has demonstrated superior performance over existing monocular and stereo imaging systems in both image PSNR by 2.96 dB gain and depth accuracy in high-frequency details across distances from 0.67 to 8 meters.
Poster
ZELIN LI · Chenwei Wang · Zhaoke Huang · Centre for Intelligent Multidimensional Data Analysis · Hong Kong Baptist University · Hong Kong Baptist University · Hong Kong Baptist University

[ ExHall D ]

Abstract
3D fluorescence microscopy is essential for understanding fundamental life processes through long-term live-cell imaging. However, due to inherent issues in imaging principles, it faces significant challenges including spatially varying noise and anisotropic resolution, where the axial resolution lags behind the lateral resolution up to 4.5 times. Meanwhile, laser power is kept low to maintain cell viability, leading to inaccessible low-noise and high-resolution paired ground truth (GT). To tackle these limitations, a dual Cycle-consistent Diffusion is proposed to effectively mine intra-volume imaging priors within 3D cell volumes in an unsupervised manner, i.e., Volume Tells (VTCD), achieving de-noising and super-resolution (SR) simultaneously. Specifically, a spatially iso-distributed denoiser is designed to exploit the noise distribution consistency between adjacent low-noise and high-noise regions within the 3D cell volume, suppressing the spatially varying noise.Then, in light of the structural consistency of the cell volume, a cross-plane global-propagation SR module propagates high-resolution details from the XY plane into adjacent regions in the XZ and YZ planes, progressively enhancing resolution across the entire 3D cell volume.Experimental results on 10 in vivo cellular dataset demonstrate high improvements in both denoising and super-resolution, with axial resolution enhanced from ~ 430 nm to ~ 90 nm.
Poster
Jungho Lee · Suhwan Cho · Taeoh Kim · Ho-Deok Jang · Minhyeok Lee · Geonho Cha · Dongyoon Wee · Dogyoon Lee · Sangyoun Lee

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has attracted significant attention for its high-quality novel view rendering, inspiring research to address real-world challenges. While conventional methods depend on sharp images for accurate scene reconstruction, real-world scenarios are often affected by defocus blur due to finite depth of field, making it essential to account for realistic 3D scene representation. In this study, we propose CoCoGaussian, a \textbf{C}ircle \textbf{o}f \textbf{Co}nfusion-aware \textbf{Gaussian} Splatting that enables precise 3D scene representation using only defocused images. CoCoGaussian addresses the challenge of defocus blur by modeling the Circle of Confusion (CoC) through a physically grounded approach based on the principles of photographic defocus. Exploiting 3D Gaussians, we compute the CoC diameter from depth and learnable aperture information, generating multiple Gaussians to precisely capture the CoC shape. Furthermore, we introduce a learnable scaling factor to enhance robustness and provide more flexibility in handling unreliable depth in scenes with reflective or refractive surfaces. Experiments on both synthetic and real-world datasets demonstrate that CoCoGaussian achieves state-of-the-art performance across multiple benchmarks.
Poster
Zixuan Chen · Yujin Wang · Xin Cai · Zhiyuan You · Zhe-Ming Lu · Fan Zhang · Shi Guo · Tianfan Xue

[ ExHall D ]

Abstract
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we …
Poster
Xiaoyu Zhang · Weihong Pan · Chong Bao · Xiyu Zhang · Xiaojun Xiang · Hanqing Jiang · Hujun Bao

[ ExHall D ]

Abstract
Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, limited to balance both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene’s frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details.
Poster
Jiajun Tang · Fan Fei · Zhihao Li · Xiao Tang · Shiyong Liu · Youyu Chen · Binxiao Huang · Dave Zhenyu Chen · Xiaofei Wu · Boxin Shi

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS), a recently emerged multi-view 3D reconstruction technique, has shown significant advantages in real-time rendering and explicit editing. However, 3DGS encounters challenges in the accurate modeling of both high-frequency view-dependent appearances and global illumination effects, including inter-reflection. This paper introduces SpecTRe-GS, which addresses these challenges and models highly Specular surfaces that reflect nearby objects through Tracing Rays in 3D Gaussian Splatting. SpecTRe-GS separately models reflections from highly specular and rough surfaces to leverage the distinctions between their reflective properties, integrating an efficient ray tracer within the 3DGS framework for querying secondary rays, thus achieving fast and accurate rendering. Also, it incorporates normal prior guidance and joint geometry optimization at various stages of the training process to enhance geometry reconstruction for undistorted reflections. The proposed SpecTRe-GS demonstrates superior performance compared to existing 3DGS-based methods in capturing highly specular inter-reflections, as confirmed by experiments conducted on both synthetic and real-world scenes. We also showcase the editing applications enabled by the scene decomposition capabilities of SpecTRe-GS.
Poster
Hanxiao Sun · Yupeng Gao · Jin Xie · Jian Yang · Beibei Wang

[ ExHall D ]

Abstract
Reconstructing 3D assets from images, known as inverse rendering (IR), remains a challenging task due to its ill-posed nature and the complexities of appearance and lighting. 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities for novel view synthesis (NVS) tasks. It has also been introduced into relighting by decoupling radiance into Bidirectional Reflectance Distribution Function (BRDF) parameters and environmental lighting. Unfortunately, these methods often produce inferior relighting quality, exhibiting visible artifacts and unnatural indirect illumination. The main reason is the limited capability of each Gaussian, which has constant material parameters and normal, alongside the absence of physical constraints for indirect lighting. In this paper, we present a novel framework called Curved Gaussian Inverse Rendering (CG-IR), aimed at enhancing both NVS and relighting quality. To this end, we propose a new representation—Curved Gaussian (CG)—that generalizes per-Gaussian constant material parameters to allow for spatially varying parameters, indicating that different regions of each Gaussian can have various normals and material properties. This enhanced representation is complemented by a CG splatting scheme akin to vertex/fragment shading in traditional graphics pipelines. Furthermore, we integrate a physically-based indirect lighting model, enabling more realistic relighting. The proposed CG-IR framework significantly improves rendering quality, outperforming state-of-the-art NeRF-based methods …
Poster
Qiyu Dai · Xingyu Ni · Qianfan Shen · Mengyu Chu · Wenzheng Chen · Baoquan Chen

[ ExHall D ]

Abstract
We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 10 fps, offering users flexible control over rain intensity—from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and …
Poster
Ludwic Leonard · Nils Thuerey · rüdiger westermann

[ ExHall D ]

Abstract
We introduce a single-view reconstruction technique of volumetric fields in which multiple light scattering effects are omnipresent, such as in clouds. We model the unknown distribution of volumetric fields using an unconditional diffusion model trained on a novel benchmark dataset comprising 1,000 synthetically simulated volumetric density fields. The neural diffusion model is trained on the latent codes of a novel, diffusion-friendly, monoplanar representation. The generative model is used to incorporate a tailored parametric diffusion posterior sampling technique into different reconstruction tasks. A physically-based differentiable volume renderer is employed to provide gradients with respect to light transport in the latent space. This stands in contrast to classic NeRF approaches and makes the reconstructions better aligned with observed data. Through various experiments, we demonstrate single-view reconstruction of volumetric clouds at a previously unattainable quality.
Poster
Juan Rodriguez · Abhay Puri · Shubham Agarwal · Issam Laradji · Pau Rodriguez · Sai Rajeswar · David Vazquez · Christopher Pal · Marco Pedersoli

[ ExHall D ]

Abstract
Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond \textit{path} curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and three tasks: image vectorization, text-driven SVG generation, and diagram generation. Using this contribution, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.
Poster
Cheng Sun · Jaesung Choe · Charles Loop · Wei-Chiu Ma · Yu-Chiang Frank Wang

[ ExHall D ]

Abstract
We propose an efficient radiance field rendering algorithm that incorporates a rasterization process on sparse voxels without neural networks or 3D Gaussians. There are two key contributions coupled with the proposed system.The first is to render sparse voxels in the correct depth order along pixel rays by using dynamic Morton ordering. This avoids the well-known popping artifact found in Gaussian splatting. Second, we adaptively fit sparse voxels to different levels of detail within scenes, faithfully reproducing scene details while achieving high rendering frame rates.Our method improves the previous neural-free voxel grid representation by over 4db PSNR and more than 10x rendering FPS, achieving state-of-the-art comparable novel-view synthesis results.Additionally, our neural-free sparse voxels are seamlessly compatible with grid-based 3D processing algorithms.We achieve promising mesh reconstruction accuracy by integrating TSDF-Fusion and Marching Cubes into our sparse grid system.
Poster
Minye Wu · Haizhao Dai · Kaixin Yao · Jingyi Yu · Tinne Tuytelaars

[ ExHall D ]

Abstract
Differentiable rendering enables efficient optimization by allowing gradients to be computed through the rendering process, facilitating 3D reconstruction, inverse rendering and neural scene representation learning. To ensure differentiability, existing solutions approximate or re-formulate traditional rendering operations using smooth, probabilistic proxies such as volumes or Gaussian primitives. Consequently, they struggle to preserve sharp edges due to the lack of explicit boundary definitions. We present a novel hybrid representation, Bézier Gaussian Triangle (BG-Triangle), that combines Bézier triangle-based vector graphics primitives with Gaussian-based probabilistic models, to maintain accurate shape modeling while conducting resolution-independent differentiable rendering. We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. We also employ an adaptive densification and pruning scheme for efficient training while reliably handling level-of-detail (LoD) variations. Experiments show that BG-Triangle achieves comparable rendering quality as 3DGS but with superior boundary preservation. More importantly, BG-Triangle uses a much smaller number of primitives than its alternatives, showcasing the benefits of vectorized graphics primitives and the potential to bridge the gap between classic and emerging representations.
Poster
Himangi Mittal · Peiye Zhuang · Hsin-Ying Lee · Shubham Tulsiani

[ ExHall D ]

Abstract
We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation' i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material information. Compared to prior neural constitutive modeling approaches which learn scene specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials -- elastic, plasticine, sand, and fluids~(Newtonian & non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.
Poster
Kaiwei Zhang · Dandan Zhu · Xiongkuo Min · Guangtao Zhai

[ ExHall D ]

Abstract
Mesh saliency enhances the adaptability of 3D vision by identifying and emphasizing regions that naturally attract visual attention. To investigate the interaction between geometric structure and texture in shaping visual attention, we establish a comprehensive mesh saliency dataset, which is the first to systematically capture the differences in saliency distribution under both textured and non-textured visual conditions. Furthermore, we introduce mesh Mamba, a unified saliency prediction model based on a state space model (SSM), designed to adapt across various mesh types. Mesh Mamba effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. More importantly, by subgraph embedding and a bidirectional SSM, the model enables global context modeling for both local geometry and texture, preserving the topological structure and improving the understanding of visual details and structural complexity. Through extensive theoretical and empirical validation, our model not only improves performance across various mesh types but also demonstrates high scalability and versatility, particularly through cross validations of various visual features.
Poster
Xiaoliang Ju · Hongsheng Li

[ ExHall D ]

Abstract
We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels.To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects.The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results.
Poster
Mark Boss · Zixuan Huang · Aaryaman Vasishta · Varun Jampani

[ ExHall D ]

Abstract
We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. Unlike most existing approaches, SF3D is explicitly trained for mesh generation, incorporating a fast UV unwrapping technique that enables swift texture generation rather than relying on vertex colors. The method also learns to predict material parameters and normal maps to enhance the visual quality of the reconstructed 3D meshes. Furthermore, SF3D integrates a delighting step to effectively remove low-frequency illumination effects, ensuring that the reconstructed meshes can be easily used in novel illumination conditions. Experiments demonstrate the superior performance of SF3D over the existing techniques.
Poster
Rui Chen · Jianfeng Zhang · Yixun Liang · Guan Luo · Weiyu Li · Jiarui Liu · Xiu Li · Xiaoxiao Long · Jiashi Feng · Ping Tan

[ ExHall D ]

Abstract
Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely-adopted uniform point sampling strategy in Shape VAE training often leads to significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks.We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches.To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8× smaller (1,280 vs. 10,000 codes).We will release our code and benchmark dataset to facilitate future research in 3D shape modeling.
Poster
Suizhi Huang · Xingyi Yang · Hongtao Lu · Xinchao Wang

[ ExHall D ]

Abstract
Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.
Poster
Amir Barda · Matheus Gadelha · Vladimir G. Kim · Noam Aigerman · Amit H. Bermano · Thibault Groueix

[ ExHall D ]

Abstract
We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in 3 seconds, without the need for running an SDS type of optimization.Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.
Poster
Sinisa Stekovic · Arslan Artykov · Stefan Ainetter · Mattia Durso · Friedrich Fraundorfer

[ ExHall D ]

Abstract
We propose PyTorchGeoNodes, a differentiable module for 3D object reconstruction from images using interpretable shape programs. Unlike traditional CAD model retrieval, shape programs allow semantic reasoning, editing, and a low memory footprint. Despite their potential, shape programs for 3D scene understanding have been largely overlooked. Our key contribution is enabling gradient-based optimization by translating shape programs, like those in Blender, into efficient PyTorch code. Additionally, we show that a combination of PyTorchGeoNodes with Genetic Algorithms is a method of choice to optimize both discrete and continuous shape program parameters for 3D reconstruction, and can be further integrated with other reconstruction algorithms such as Gaussian Splats. Our experiments on the ScanNet dataset show that our method achieves accurate reconstructions.
Poster
Susung Hong · Johanna Suvi Karras · Ricardo Martin · Ira Kemelmacher-Shlizerman

[ ExHall D ]

Abstract
The fields of 3D reconstruction and text-based 3D editing have advanced significantly with the evolution of text-based diffusion models. While existing 3D editing methods excel at modifying color, texture, and style, they struggle with extensive geometric or appearance changes, thus limiting their applications. We propose \textbf{Perturb-and-Revise}, which makes possible a variety of NeRF editing. First, we \textbf{perturb} the NeRF parameters with random initializations to create a versatile initialization. We automatically determine the perturbation magnitude through analysis of the local loss landscape. Then, we \textbf{revise} the edited NeRF via generative trajectories. Combined with the generative process, we impose identity-preserving gradients to refine the edited NeRF. Extensive experiments demonstrate that Perturb-and-Revise facilitates flexible, effective, and consistent editing of color, appearance, and geometry in 3D without model retraining.
Poster
Yufei Huang · Bangyan Liao · Yuqi Hu · Haitao Lin · Lirong Wu · Siyuan Li · Cheng Tan · Zicheng Liu · Yunfan Liu · Zelin Zang · Chang Yu · Zhen Lei

[ ExHall D ]

Abstract
Score Distillation Sampling (SDS) has been successfully extended to text-driven 3D scene editing with 2D pretrained diffusion models. However, SDS-based editing methods suffer from lengthy optimization processes with slow inference and low quality. We attribute the issue of lengthy optimization to the stochastic optimization scheme used in SDS-based editing, where many steps may conflict with each other (e.g., the inherent trade-off between editing and preservation). To reduce this internal conflict and speed up the editing process, we propose to separate editing and preservation in time with a diffusion time schedule and frame the 3D editing optimization process as a diffusion bridge sampling process. Motivated by the analysis above, we introduce DaCapo, a fast diffusion sampling-like 3D editing method that incorporates a novel stacked bridge framework, which estimates a direct diffusion bridge between source and target distribution with only a pretrained 2D diffusion model. Specifically, It models the editing process as a combination of inversion and generation, where both processes happen simultaneously as a stack of Diffusion Bridges. DaCapo shows a 15× speed-up with comparable results to the state-of-the-art SDS-based method. It completes the process in just 2,500 steps on a single GPU and accommodates a variety of 3D representation methods.
Poster
Takuhiro Kaneko

[ ExHall D ]

Abstract
Recent advancements in neural 3D representations, such as neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS), have made accurate estimation of the 3D structure from multiview images possible. However, this capability is limited to estimating the visible external structure, and it is still difficult to identify the invisible internal structure hidden behind the surface. To overcome this limitation, we address a new task called structure from collision (SfC), which aims to estimate the structure (including the invisible internal one) of an object from the appearance changes at collision. To solve this task, we propose a novel model called SfC-NeRF, which optimizes the invisible internal structure (i.e., internal volume density) of the object through a video sequence under physical, appearance (i.e., visible external structure)-preserving, and key-frame constraints. In particular, to avoid falling into undesirable local optima owing to its ill-posed nature, we propose volume annealing, i.e., searching for the global optima by repeatedly reducing and expanding the volume. Extensive experiments on 60 cases involving diverse structures (i.e., various cavity shapes, locations, and sizes) and various material properties reveal the properties of SfC and demonstrate the effectiveness of the proposed SfC-NeRF.
Poster
Zixuan Chen · Guangcong Wang · Jiahao Zhu · Jianhuang Lai · Xiaohua Xie

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has recently created impressive assets for various applications.However, the copyright of these assets is not well protected as existing watermarking methods are not suited for 3DGS considering security, capacity, and invisibility.Besides, these methods often require hours or even days for optimization, limiting the application scenarios.In this paper, we propose **GuardSplat**, an innovative and efficient framework that effectively protects the copyright of 3DGS assets.Specifically, **1)** We first propose a CLIP-guided Message Decoupling Optimization module for training the message decoder, leveraging CLIP's aligning capability and rich representations to achieve a high extraction accuracy with minimal optimization costs, presenting exceptional **capability** and **efficiency**.**2)** Then, we propose a Spherical-harmonic-aware (SH-aware) Message Embedding module tailored for 3DGS, which employs a set of SH offsets to seamlessly embed the message into the SH features of each 3D Gaussian while maintaining the original 3D structure.It enables the 3DGS assets to be watermarked with minimal fidelity trade-offs and prevents malicious users from removing the messages from the model files, meeting the demands for **invisibility** and **security**.**3)** We further propose an Anti-distortion Message Extraction module to improve **robustness** against various visual distortions.Extensive experiments demonstrate that **GuardSplat** outperforms the state-of-the-art methods and achieves fast optimization speed.The …
Poster
Zhengxue Wang · Zhiqiang Yan · Jinshan Pan · Guangwei Gao · Kai Zhang · Jian Yang

[ ExHall D ]

Abstract
Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradations, outperforming existing methods.
Poster
Hengyu Liu · Yuehao Wang · Chenxin Li · Ruisi Cai · Kevin Wang · Wuyang Li · Pavlo Molchanov · Peihao Wang · Zhangyang Wang

[ ExHall D ]

Abstract
3D Gaussian splatting (3DGS) has enabled various applications in 3D scene representation and novel view synthesis due to its efficient rendering capabilities. However, 3DGS demands significant GPU memory, limiting its use on devices with restricted computational resources. Previous approaches have focused on pruning less important Gaussians, effectively compressing 3DGS but often requiring a fine-tuning stage and lacking adaptability for the specific memory needs of different devices. In this work, we present an elastic inference method for 3DGS. Given an input for the desired model size, our method selects and transforms a subset of Gaussians, achieving substantial rendering performance without additional fine-tuning. We introduce a tiny learnable module that controls Gaussian selection based on the input percentage, along with a transformation module that adjusts the selected Gaussians to complement the performance of the reduced model. Comprehensive experiments on ZipNeRF, MipNeRF and Tanks\&Temples scenes demonstrate the effectiveness of our approach. Code will be publicly available.
Poster
You Shen · Zhipeng Zhang · Xinyang Li · Yansong Qu · Yu Lin · Shengchuan Zhang · Liujuan Cao

[ ExHall D ]

Abstract
Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry reconstruction remains difficult. Existing methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks. However, the former suffers from inherent conflicts between rendering and reconstruction, while the latter is computationally and storage-intensive. To address these challenges, we propose CarGS, a unified model leveraging Contribution adaptive regularization to achieve simultaneous, high-quality rendering and surface reconstruction. The essence of our framework is learning adaptive contribution for Gaussian primitives by squeezing the knowledge from geometry regularization into a compact MLP. Additionally, we introduce a geometry-guided densification strategy with clues from both normals and Signed Distance Fields (SDF) to improve the capability of capturing high-frequency details. Our design improves the mutual learning of the two tasks, meanwhile its unified structure doesn't require separate models as in dual-model based approaches, guaranteeing efficiency. Extensive experiments demonstrate CarGS’s ability to achieve state-of-the-art (SOTA) results in both rendering fidelity …
Poster
Suyoung Lee · JAEYOUNG CHUNG · Kihoon Kim · Jaeyoo Huh · Gunhee Lee · Minsoo Lee · Kyoung Mu Lee

[ ExHall D ]

Abstract
Feed-forward 3D Gaussian Splatting (3DGS) models have gained significant popularity due to their ability to generate scenes immediately without needing per-scene optimization.Although omnidirectional images are getting more popular since they reduce the computation for image stitching to composite a holistic scene, existing feed-forward models are only designed for perspective images.The unique optical properties of omnidirectional images make it difficult for feature encoders to correctly understand the context of the image and make the Gaussian non-uniform in space, which hinders the image quality synthesized from novel views.We propose OmniSplat, a pioneering work for fast feed-forward 3DGS generation from a few omnidirectional images.We introduce Yin-Yang grid and decompose images based on it to reduce the domain gap between omnidirectional and perspective images.The Yin-Yang grid can use the existing CNN structure as it is, but its quasi-uniform characteristic allows the decomposed image to be similar to a perspective image, so it can exploit the strong prior knowledge of the learned feed-forward network.OmniSplat demonstrates higher reconstruction accuracy than existing feed-forward networks trained on perspective images.Furthermore, we enhance the segmentation consistency between omnidirectional images by leveraging attention from the encoder of OmniSplat, providing fast and clean 3DGS editing results.
Poster
Chung-Ho Wu · Yang-Jung Chen · Ying-Huan Chen · Jie-Ying Lee · Bo-Hsu Ke · Chun-Wei Tuan Mu · Yichuan Huang · Chin-Yang Lin · Min-Hung Chen · Yen-Yu Lin · Yu-Lun Liu

[ ExHall D ]

Abstract
Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method leveraging 3D Gaussian Splatting for high-quality object removal and hole filling. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion for geometric consistency, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.
Poster
Chong Bao · Xiyu Zhang · Zehao Yu · Jiale Shi · Guofeng Zhang · Songyou Peng · Zhaopeng Cui

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view rendering with dense input views and accurate poses. However, applying 3DGS to sparse, unposed views in unbounded 360-degree scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360-degree scenes.To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers.By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a reconstruction bootstrap optimization that refines the noise and distortions in the coarse geometry.Furthermore, we propose an iterative fusion of reconstruction and generation, facilitating mutual conditioning and enhancement between these two processes.Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy.
Poster
Nicole Meng · Caleb Manicke · Ronak Sahu · Caiwen Ding · Yingjie Lao

[ ExHall D ]

Abstract
Generalizable Neural Radiance Fields (GNeRF) are recognized as one of the most promising techniques for novel view synthesis and 3D model generation in real-world applications. However, like other generative models in computer vision, ensuring their adversarial robustness against various threat models is essential for practical use. The pioneering work in this area, NeRFool, introduced a state-of-the-art attack that targets GNeRFs by manipulating source views before feature extraction, successfully disrupting the color and density results of the constructed views. Building on this foundation, we propose IL2-NeRF(Iterative L2 NeRF Attack), a novel adversarial attack method that explores a new threat model(in the L2 domain) for attacking GNeRFs. We evaluated IL2-NeRF against two standard GNeRF models across three benchmark datasets, demonstrating similar performance compared to NeRFool, based on the same evaluation metrics proposed by NeRFool. Our results establish IL2-NeRF as the first adversarial method for GNeRFs under the L2 norm. We establish a foundational L2 threat model for future research, enabling direct performance comparisons while introducing a smoother, image-wide perturbation approach in Adversarial 3D Reconstruction.
Poster
Jiahe Li · Feiyu Wang · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Ting Liu

[ ExHall D ]

Abstract
Gaussian Splatting (GS)-based methods rely on sufficient training view coverage and perform synthesis on interpolated views. In this work, we tackle the more challenging and underexplored Extrapolated View Synthesis (EVS) task. Here we enable GS-based models trained with limited view coverage to generalize well to extrapolated views. To achieve our goal, we propose a view augmentation framework to guide training through a coarse-to-fine process. At the coarse stage, we reduce rendering artifacts due to insufficient view coverage by introducing a regularization strategy at both appearance and geometry levels. At the fine stage, we generate reliable view priors to provide further training guidance. To this end, we incorporate an occlusion awareness into the view prior generation process, and refine the view priors with the aid of coarse stage output. We call our framework Enhanced View Prior Guidance for Splatting (EVPGS). To comprehensively evaluate EVPGS on the EVS task, we collect a real-world dataset called Merchandise3D dedicated to the EVS scenario. Experiments on three datasets including both real and synthetic demonstrate EVPGS achieves state-of-the-art performance, while improving synthesis quality at extrapolated views for GS-based methods both qualitatively and quantitatively. We will make our code, dataset, and models public.
Poster
Xiaoding Yuan · Shitao Tang · Kejie Li · Peng Wang

[ ExHall D ]

Abstract
This paper introduces Camera-free Diffusion (CamFreeDiff) model for 360 image outpainting from a single camera-free image and text description. This method distinguishes itself from existing strategies, such as MVDiffusion, by eliminating the requirement for predefined camera poses. CamFreeDiff seamlessly incorporates a mechanism for predicting homography within the multi-view diffusion framework. The key component of our approach is to formulate camera estimation by directly predicting the homography transformation from the input view to the predefined canonical view. In contrast to the direct two-stage approach of image transformation and outpainting, CamFreeDiff utilizes predicted homography to establish point-level correspondences between the input view and the target panoramic view. This enables consistency through correspondence-aware attention, which is learned in a fully differentiable manner. Qualitative and quantitative experimental results demonstrate the strong robustness and performance of CamFreeDiff for 360 image outpainting in the challenging context of camera-free inputs.
Poster
Yash Kant · Ethan Weber · Jin Kyu Kim · Rawal Khirodkar · Zhaoen Su · Julieta Martinez · Igor Gilitschenski · Shunsuke Saito · Timur Bagautdinov

[ ExHall D ]

Abstract
We present Pippo, a generative model capable of producing a dense set of high-resolution (1K) multi-view images of a person from a single photo. Our approach does not require any parametric model fitting or camera parameters of the input image, and generalizes to arbitrary identities with diverse clothing and hair styles. Pippo is a multi-view diffusion transformer trained in multiple stages. First, we pretrain the model on a billion-scale human-centric image dataset. Second, we train the model on studio data to generate many low-resolution consistent views conditioned on a coarse camera and an input image. Finally, we fine-tune the model on high-resolution data for multi-view generation with minimal placement controls, further improving consistency. This training strategy allows us to retain the generalizability from the large-scale pretraining while enabling high-resolution multi-view synthesis. We investigate several key architecture design choices for multi-view generation with diffusion transformers for precise view and identity control. Using a newly introduced 3D consistency metric, we demonstrate that Pippo outperforms existing approaches on multi-view human generation from a single image.
Poster
Yihang Luo · Shangchen Zhou · Yushi Lan · Xingang Pan · Chen Change Loy

[ ExHall D ]

Abstract
Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and multi-view row attention and epipolar aggregation modules to ensure high-quality, consistent 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence under diverse viewing angles. Extensive evaluations demonstrate that 3DEnhancer significantly outperforms existing methods, improving both multi-view enhancement and per-instance 3D optimization tasks. Code and model will be publicly available.
Poster
Hanwen Jiang · Zexiang Xu · Desai Xie · Chen Ziwen · Haian Jin · Fujun Luan · ZHIXIN SHU · Kai Zhang · Sai Bi · Xin Sun · Jiuxiang Gu · Qixing Huang · Georgios Pavlakos · Hao Tan

[ ExHall D ]

Abstract
We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes—over 50 times larger than the prior real dataset DL3DV—dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization. We will make our data and the data generation code public upon publication.
Poster
Haofei Xu · Songyou Peng · Fangjinhua Wang · Hermann Blum · Daniel Barath · Andreas Geiger · Marc Pollefeys

[ ExHall D ]

Abstract
Gaussian splatting and single/multi-view depth estimation are typically studied in isolation. In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. More specifically, we first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features, leading to high-quality feed-forward 3D Gaussian splatting reconstructions. We also show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models from large-scale unlabeled datasets. We validate the synergy between Gaussian splatting and depth estimation through extensive ablation and cross-task transfer experiments. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in terms of both depth estimation and novel view synthesis, demonstrating the mutual benefits of connecting both tasks. We invite the readers to view our supplementary video for feed-forward reconstruction results of large-scale or 360 scenes from up to 12 input views at 512×960 resolutions. Our code and models will be publicly available.
Poster
Fan Yang · Jianfeng Zhang · Jun Hao Liew · Chaoyue Song · Zhongcong Xu · Xiu Li · Jiashi Feng · Guosheng Lin

[ ExHall D ]

Abstract
Multi-view image synthesis models are limited by a lack of training data. Fine-tuning well-trained video generative models to generate 360-degree videos of objects offers a promising solution, as they inherit the strong generative priors from the pretrained knowledge. However, these methods often face computational bottlenecks due to the large number of viewpoints, with temporal attention mechanisms often used to mitigate this. Unfortunately, such techniques can introduce artifacts like 3D inconsistency and over-smoothing. To overcome this, we propose a novel sparsification approach that reduces the video diffusion model into sparse view synthesis. We first extract rich geometric priors from pretrained video diffusion models and then conduct high-fidelity sparse multi-view synthesis to improve the 3D consistency. Extensive experiments show that our approach achieves superior efficiency, generalization, and consistency, outperforming state-of-the-art multi-view synthesis methods
Poster
Alex Trevithick · Roni Paiss · Philipp Henzler · Dor Verbin · Rundi Wu · Hadi Alzayer · Ruiqi Gao · Ben Poole · Jonathan T. Barron · Aleksander Holynski · Ravi Ramamoorthi · Pratul P. Srinivasan

[ ExHall D ]

Abstract
Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies.
Poster
Hanyang Wang · Fangfu Liu · Jiawei Chi · Yueqi Duan

[ ExHall D ]

Abstract
Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our …
Poster
Liyan Chen · Huangying Zhan · Kevin Chen · Xiangyu Xu · Qingan Yan · Changjiang Cai · Yi Xu

[ ExHall D ]

Abstract
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments.The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity.Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER's effectiveness in active mapping tasks.
Poster
Dongrui Dai · Yuxiang Xing

[ ExHall D ]

Abstract
3D Gaussian splatting (3DGS) has shown impressive performance in 3D scene reconstruction. However, it suffers from severe degradation when the number of training views is limited, resulting in blur and floaters. Many works have been devoted to standardize the optimization process of 3DGS through regularization techniques. However, we identify that inadequate initialization is a critical issue overlooked by current studies. To address this, we propose EAP-GS, a method to enhance initialization for fast, accurate, and stable few-shot scene reconstruction. Specifically, we introduce an Attentional Pointcloud Augmentation (APA) technique, which retains two-view tracks as an option for pointcloud generation. Additionally, the scene complexity is used to determine the required density distribution, thereby constructing a better pointcloud. We implemented APA by extending Structure-From-Motion (SFM) to focus on pointcloud generation in regions with complex structure but sparse pointcloud distribution, dramatically increasing the number of valuable points and effectively harmonizing the density distribution. A better pointcloud leads to more accurate scene geometry and mitigates local overfitting during reconstruction stage. Experimental results on forward-facing datasets from various indoor and outdoor scenes demonstrate that the proposed EAP-GS achieves outstanding scene reconstruction performance and surpasses state-of-the-art methods.
Poster
Guoyu Lu

[ ExHall D ]

Abstract
Scene reconstruction has a wide range of applications in computer vision and robotics. To build practical constraints and feature correspondences, rich textures and distinguished gradient variations are particularly required in classic and learning-based SfM. When building low-texture regions with repeated patterns, especially mostly white indoor rooms, there is a significant drop in the performance. Inthis work, we propose Shading-SfM-Net (Shading & structure-from motion network), a novel framework for simultaneously learninga shape-from-shading network based on the inverse rendering constraint and a structure-from-motion framework based on warped keypoint and geometric consistency, to improve structure-from-motion and surface reconstruction for low-texture indoor scenes. Shading-SfM-Net tightly incorporates the surface shape consistency and 3D geometric registration loss in order to dig into their mutual informationand further overcome the instability on flat regions. We evaluate the proposed framework on texture-less indoor scenes (NYU v2and ScanNet), and show that for each individual network without simultaneous training, our method is able to achieve comparableresult to the state of the art methods. By simultaneously learning shape and motion from the two networks, our pipeline is able toachieve state-of-the-art performance with superior generalization capability for unseen texture-less datasets.
Poster
Jinbo Yan · Rui Peng · Zhiyan Wang · Luyang Tang · Jiayu Yang · Jie Liang · Jiahao Wu · Ronggang Wang

[ ExHall D ]

Abstract
Building Free-Viewpoint Videos in a streaming manner offers the advantage of rapid responsiveness compared to offline training methods, greatly enhancing user experience. However, current streaming approaches face challenges of high per-frame reconstruction time (10s+) and error accumulation, limiting their broader application. In this paper, we propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework, to address these issues. First, we introduce a generalized Anchor-driven Gaussian Motion Network, which projects multi-view 2D motion features into 3D space, using anchor points to drive the motion of all Gaussians. This generalized Network generates the motion of Gaussians for each target frame in the time required for a single inference. Second, we propose a Key-frame-guided Streaming Strategy that refines each key frame, enabling accurate reconstruction of temporally complex scenes while mitigating error accumulation. We conducted extensive in-domain and cross-domain evaluations, demonstrating that our approach can achieve streaming with a average per-frame reconstruction time of 2s+, alongside a enhancement in view synthesis quality.
Poster
Yiren Lu · Yunlai Zhou · Disheng Liu · tuo liang · Yu Yin

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which is not that trivial to fulfill in the real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.
Poster
Chengwei Zheng · Lixin Xue · Juan Jose Zarate · Jie Song

[ ExHall D ]

Abstract
3D Gaussian Splatting techniques have enabled efficient photo-realistic rendering of static scenes. Recent works have extended these approaches to support surface reconstruction and tracking. However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GSTAR maintains the mesh topology and tracks the meshes using Gaussians. In regions where topology changes, GSTAR adaptively unbinds Gaussians from the mesh, enabling accurate registration and the generation of new surfaces based on these optimized Gaussians. Additionally, we introduce a surface-based scene flow method that provides robust initialization for tracking between frames. Experiments demonstrate that our method effectively tracks and reconstructs dynamic surfaces, enabling a range of applications. We will release our implementation to facilitate future research.
Poster
Sotiris Nousias · Mian Wei · Howard Xiao · Maxx Wu · Shahmeer Athar · Kevin J Wang · Anagh Malik · David A. Barmherzig · David B. Lindell · Kiriakos Kutulakos

[ ExHall D ]

Abstract
Scattered light from pulsed lasers is increasingly part of our ambient illumination, as many devices rely on them for active 3D sensing. In this work, we ask: can these “ambient” light signals be detected and leveraged for passive 3D vision? We show that pulsed lasers, despite being weak and fluctuating at MHz to GHz frequencies, leave a distinctive sinc comb pattern in the temporal frequency domain of incident flux that is specific to each laser and invariant to the scene. This enables their passive detection and analysis with a free-running SPAD camera, even when they are unknown, asynchronous, out of sight, and emitting concurrently. We show how to synchronize with such lasers computationally, characterize their pulse emissions, separate their contributions, and—if many are present—localize them in 3D and recover a depth map of the camera’s field of view. We use our camera prototype to demonstrate (1) a first-of-its-kind visualization of asynchronously propagating light pulses from multiple lasers through the same scene, (2) passive estimation of a laser’s MHz-scale pulse repetition frequency with mHz precision, and (3) mm-scale 3D imaging over room-scale distances by passively harvesting photons from two or more out-of-view lasers.
Poster
Zhengxian Yang · Shi Pan · Shengqi Wang · Haoxiang Wang · Li Lin · Guanjun Li · Zhengqi Wen · Borong Lin · Jianhua Tao · Tao Yu

[ ExHall D ]

Abstract
User engagement is greatly enhanced by fully immersive multimodal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interactive space, Multi-modal feedback, and high resolution\&frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce **ImViD**, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, which significantly enhances the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multimodal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
Poster
Peter Kulits · Michael J. Black · Silvia Zuffi

[ ExHall D ]

Abstract
The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes composed of trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting important environmental context. This limits their usefulness for analysis tasks, as animals inherently exist within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct a natural scene from a single image. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models, and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one-million images and thousands of assets. Our approach, trained exclusively …
Poster
Muhammad Hamza Mughal · Rishabh Dabral · Merel CJ Scholman · Vera Demberg · Christian Theobalt

[ ExHall D ]

Abstract
Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to watch the supplementary video.
Poster
Suhyun Shin · Seungwoo Yoon · Ryota Maeda · Seung-Hwan Baek

[ ExHall D ]

Abstract
Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times—often over several minutes—or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film.We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
Poster
Jongsung Lee · HARIN PARK · Byeong-Uk Lee · Kyungdon Joo

[ ExHall D ]

Abstract
Motivated by the efficiency of spherical harmonics (SH) in representing various physical phenomena, we propose a Holistic panoramic 3D scene Understanding framework using Spherical Harmonics, dubbed as HUSH. Our approach focuses on a unified framework adaptable to various 3D scene understanding tasks via SH bases. To achieve this, we first estimate SH coefficients, allowing for the adaptive configuration of the SH bases specific to each scene. HUSH then employs a hierarchical attention module that uses SH bases as queries to generate comprehensive scene features by integrating these scene-adaptive SH bases with image features. Additionally, we introduce an SH basis index module that adaptively emphasizes relevant SH bases to produce task-relevant features, enhancing the versatility of HUSH across different scene understanding tasks. Finally, by combining the scene features with task-relevant features in the task-specific heads, we perform various scene understanding tasks, including depth, surface normal and room layout estimation. Experiments demonstrate that HUSH achieves state-of-the-art performance on depth estimation benchmarks, highlighting the robustness and scalability of using SH in panoramic 3D scene understanding.
Poster
Kang Chen · Jiyuan Zhang · Zecheng Hao · Yajing Zheng · Tiejun Huang · Zhaofei Yu

[ ExHall D ]

Abstract
Spike cameras, as an innovative neuromorphic camera that captures scenes with the 0-1 bit stream at 40 kHz, are increasingly employed for the 3D reconstruction task via Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Previous spike-based 3D reconstruction approaches often employ a casecased pipeline: starting with high-quality image reconstruction from spike streams based on established spike-to-image reconstruction algorithms, then progressing to camera pose estimation and 3D reconstruction. However, this cascaded approach suffers from substantial cumulative errors, where quality limitations of initial image reconstructions negatively impact pose estimation, ultimately degrading the fidelity of the 3D reconstruction. To address these issues, we propose a synergistic optimization framework USP-Gaussian, that unifies spike-based image reconstruction, pose correction, and Gaussian splatting into an end-to-end framework. Leveraging the multi-view consistency afforded by 3DGS and the motion capture capability of the spike camera, our framework enables a joint iterative optimization that seamlessly integrates information between the spike-to-image network and 3DGS. Experiments on synthetic datasets with accurate poses demonstrate that our method surpasses previous approaches by effectively eliminating cascading errors. Moreover, we integrate pose optimization to achieve robust 3D reconstruction in real-world scenarios with inaccurate initial poses, outperforming alternative methods by effectively reducing noise and preserving …
Poster
Xuan Zhu · Jijun Xiang · Xianqi Wang · Longliang Liu · Yu Wang · Hong Zhang · Fei Guo · Xin Yang

[ ExHall D ]

Abstract
Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets.
Poster
Nisha Varghese · A. N. Rajagopalan

[ ExHall D ]

Abstract
Underwater (UW) robotics applications require depth and restored images simultaneously in real-time, irrespective of whether the UW images are captured in good lighting conditions or not. Most of the UW image restoration and depth estimation methods have been devised for images under normal lighting. Consequently, they struggle to perform on poorly lit images. Even though artificial illumination can be used when there is insufficient ambient light, it can introduce non-uniform lighting artifacts in the restored images. Hence, the recovery of depth and restored images directly from Low-Light UW (LLUW) images is a critical requirement in marine applications. While a few works have attempted LLUW image restoration, there are no reported works on joint recovery of depth and clean image from LLUW images. We propose a Self-supervised Low-light Underwater Image and Depth recovery network (SelfLUID-Net) for joint estimation of depth and restored image in real-time from a single LLUW image. We have collected an Underwater Low-light Stereo Video (ULVStereo) dataset which is the first-ever UW dataset with stereo pairs of low-light and normally-lit UW images. For the dual tasks of image and depth recovery from a LLUW image, we effectively utilize the stereo data from ULVStereo that provides cues for both …
Poster
Jingyi Zhou · Peng Ye · Haoyu Zhang · Jiakang Yuan · Rao Qiang · Liu YangChenXu · Wu Cailin · Feng Xu · Tao Chen

[ ExHall D ]

Abstract
Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.
Poster
Yuzheng Liu · Siyan Dong · Shuzhe Wang · Yingda Yin · Yanchao Yang · Qingnan Fan · Baoquan Chen

[ ExHall D ]

Abstract
In this paper, we introduce SLAM3R, a novel and effective monocular RGB SLAM system for real-time and high-quality dense 3D reconstruction. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given a video input, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images and then progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Upon acceptance, we will release our code to support further research.
Poster
Diankun Wu · Fangfu Liu · Yi-Hsin Hung · Yue Qian · Xiaohang Zhan · Yueqi Duan

[ ExHall D ]

Abstract
4D reconstruction from a single monocular video is an important but challenging task due to its inherent under-constrained nature. While most existing 4D reconstruction methods focus on multi-camera settings, they always suffer from limited multi-view information in monocular videos. Recent studies have attempted to mitigate the ill-posed problem by incorporating data-driven priors as additional supervision. However, they require hours of optimization to align the splatted 2D feature maps of explicit Gaussians with various priors, which limits the range of applications. To address the time-consuming issue, we propose \textbf{4D-Fly}, an efficient and effective framework for reconstructing the 4D scene from a monocular video (hundreds of frames within 6 minutes), more than \textbf{20} × \textbf{faster} and even achieving higher quality than previous optimization methods. Our key insight is to unleash the explicit property of Gaussian primitives and directly apply data priors to them. Specifically, we build a streaming 4D reconstruction paradigm that includes: propagating existing Gaussian to the next timestep with an anchor-based strategy, expanding the 4D scene map with the canonical Gaussian map, and an efficient 4D scene optimization process to further improve visual quality and motion accuracy. Extensive experiments demonstrate the superiority of our 4D-Fly over state-of-the-art methods in terms …
Poster
Juan Carlos Dibene Simental · Enrique Dunn

[ ExHall D ]

Abstract
We present a marker-based geometric estimation framework for the absolute pose of a camera by analyzing the 1D observations in a single radially distorted pixel scanline.We leverage a pair of known co-planar pencils of lines, along with lens distortion parameters, to propose an ensemble of solvers exploring the space of estimation strategies applicable to our setup.First, we present a minimal algebraic solver requiring only six measurements and yielding eight solutions, which relies on the intersection of two conics defined by one of the pencils of lines.Then, we present a unique closed-form geometric solver from seven measurements.Finally, we present an homography-based formulation amenable to linear least-squares from eight or more measurements.Our geometric framework constitutes a theoretical analysis on the minimum geometric context necessary to solve in closed form for the absolute pose of a single camera from a single radially distorted scanline.
Poster
Andrea Porfiri Dal Cin · Georgi Dikov · Jihong Ju · Mohsen Ghafoorian

[ ExHall D ]

Abstract
Current learning-based Structure-from-Motion (SfM) methods struggle with videos of dynamic scenes from wide-angle cameras. We present AnyMap, a differentiable SfM framework that jointly addresses image distortion and motion estimation. By learning a general implicit camera model without predefined parameters, AnyMap effectively handles lens distortion, estimating multi-view consistent 3D geometry, camera poses, and (un)projection functions. To resolve the ambiguity where motion estimation can compensate for undistortion errors and vice versa, we introduce a low-dimensional motion representation consisting of a set of learnable basis trajectories, interpolated to produce regularized motion estimates. Experimental results show that our method produces accurate camera poses, excels in camera calibration and image rectification, and enables high-quality novel view synthesis. Our low-dimensional motion representation effectively disentangles undistortion with motion estimation, outperforming existing methods.
Poster
Junchen Yu · Si-Yuan Cao · Runmin Zhang · Chenghao Zhang · Zhu Yu · Shujie Chen · Bailin Yang · Hui-Liang Shen

[ ExHall D ]

Abstract
We propose a novel unsupervised cross-modal homography estimation learning framework, named Split Supervised Homography estimation Network (SSHNet). SSHNet redefines the unsupervised cross-modal homography estimation into two supervised sub-problems, each addressed by its specialized network: a homography estimation network and a modality transfer network. To realize stable training, we introduce an effective split optimization strategy to train each network separately within its respective sub-problem. We also formulate an extra homography feature space supervision to enhance feature consistency, further boosting the estimation accuracy. Moreover, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. The training stability of SSHNet enables its cooperation with various homography estimation architectures. Experiments reveal that the SSHNet using IHN as homography estimation network, namely SSHNet-IHN, outperforms previous unsupervised approaches by a significant margin. Even compared to supervised approaches MHN and LocalTrans, SSHNet-IHN achieves 47.4\% and 85.8\% mean average corner errors (MACEs) reduction on the challenging OPT-SAR dataset. The source code is provided in the supplementary material.
Poster
Riku Murai · Eric Dexheimer · Andrew J. Davison

[ ExHall D ]

Abstract
We present a real-time monocular dense SLAM system designed bottom-up from MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system is robust on in-the-wild video sequences despite making no assumption on a fixed or parametric camera model beyond a unique camera centre. We introduce efficient methods for pointmap matching, camera tracking and local fusion, graph construction and loop closure, and second-order global optimisation. With known calibration, a simple modification to the system achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.
Poster
Yifan Yu · Shaohui Liu · Rémi Pautrat · Marc Pollefeys · Viktor Larsson

[ ExHall D ]

Abstract
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent …
Poster
Felix Wimbauer · Weirong Chen · Dominik Muhle · Christian Rupprecht · Daniel Cremers

[ ExHall D ]

Abstract
Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training.As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera motions. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera …
Poster
Yunxuan Li · Lei Fan · Xiaoying Xing · Jianxiong Zhou · Ying Wu

[ ExHall D ]

Abstract
Visual localization, the task of determining the position and orientation of a camera, typically involves three core components: offline construction of a keyframe database, efficient online keyframes retrieval, and robust local feature matching. However, significant challenges arise when there are large viewpoint disparities between the query view and the database, such as attempting localization in a corridor previously build from an opposing direction. Intuitively, this issue can be addressed by synthesizing a set of virtual keyframes that cover all viewpoints. However, existing methods for synthesizing novel views to assist localization often fail to ensure geometric accuracy under large viewpoint changes. In this paper, we introduce a confidence-aware geometric prior into 2D Gaussian splatting to ensure the geometric accuracy of the scene. Then we can render novel views through the mesh with clear structures and accurate geometry, even under significant viewpoint changes, enabling the synthesis of a comprehensive set of virtual keyframes. Incorporating this geometry-preserving virtual keyframe database into the localization pipeline significantly enhances the robustness of visual localization.
Poster
Siyan Dong · Shuzhe Wang · Shaohui Liu · Lulu Cai · Qingnan Fan · Juho Kannala · Yanchao Yang

[ ExHall D ]

Abstract
Visual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately 8 million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on 6 public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. Upon acceptance, we will make our code and training data publicly available.
Poster
Mi Luo · Zihui Xue · Alex Dimakis · Kristen Grauman

[ ExHall D ]

Abstract
Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical for applications in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignments between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and egocentric video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.
Poster
Alan Baade · Changan Chen

[ ExHall D ]

Abstract
Learning self-supervised visual correspondence is a long-studied task fundamental to visual understanding and human perception. However, existing correspondence methods largely focus on small image transformations, such as object tracking in high-framerate videos or learning pixel-to-pixel mappings between images with high view overlap. This severely limits their application in dynamic multi-view settings such as robot imitation learning or augmented reality. In this work, we introduce Predictive Cycle Consistency for learning object correspondence between extremely disjoint views of a scene without paired segmentation data. Our technique bootstraps object correspondence pseudolabels from raw image segmentations using conditional grayscale colorization and a cycle-consistency refinement prior. We then train deep ViTs on these pseudolabels, which we use to generate higher-quality pseudolabels and iteratively train better correspondence models. We demonstrate the performance of our method under both extreme in-the-wild camera view changes and across large temporal gaps in video. Our approach beats all prior supervised and prior SoTA self-supervised correspondence models on the EgoExo4D correspondence benchmark (+6.7 IoU Exo Query) and the prior SoTA self-supervised methods SiamMAE and DINO V1&V2 on the DAVIS-2017 and LVOS datasets across large frame gaps.
Poster
Ruojin Cai · Jason Y. Zhang · Philipp Henzler · Zhengqi Li · Noah Snavely · Ricardo Martin

[ ExHall D ]

Abstract
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation.Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos.We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R baseline on four diverse datasets encompassing indoor, outdoor, and object-centric scenes.Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data.
Poster
Sven Elflein · Qunjie Zhou · Laura Leal-Taixe

[ ExHall D ]

Abstract
We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.
Poster
Yuguang Li · Ivaylo Boyadzhiev · Zixuan Liu · Linda Shapiro · Alex Colburn

[ ExHall D ]

Abstract
Reconstructing precise camera poses and floor plan layouts from a set of wide-baseline RGB panoramas is a difficult and unsolved problem. We present BADGR, a novel diffusion model which performs both reconstruction and bundle adjustment (BA) optimization tasks, to refine camera poses and layouts from a given coarse state using 1D floor boundary information from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-feature outputs from a single-step Levenberg-Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layouts reconstruction with different input densities.
Poster
Chi Su · Xiaoxuan Ma · Jiajun Su · Yizhou Wang

[ ExHall D ]

Abstract
We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods. Code and models will be publicly released.
Poster
Hongwei Zheng · Han Li · Wenrui Dai · Ziyang Zheng · Chenglin Li · Junni Zou · Hongkai Xiong

[ ExHall D ]

Abstract
Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a skeleton-aware alignment to strengthen token connections. We then develop a hierarchical autoregressive modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
Poster
Weijian Deng · Dylan Campbell · Chunyi Sun · Jiahao Zhang · Shubham Kanitkar · Matthew Shaffer · Stephen Gould

[ ExHall D ]

Abstract
Foundation models have significantly reduced the need for task-specific training, while also enhancing generalizability. However, state-of-the-art 6D pose estimators either require further training with pose supervision or neglect advances obtainable from 3D foundation models. The latter is a missed opportunity, since these models are better equipped to predict 3D-consistent features, which are of significant utility for the pose estimation task. To address this gap, we propose Pos3R, a method for estimating the 6D pose of any object from a single RGB image, making extensive use of a 3D reconstruction foundation model and requiring no additional training. We identify template selection as a particular bottleneck for existing methods that is significantly alleviated by the use of a 3D model, which can more easily distinguish between template poses than a 2D model. Despite its simplicity, Pos3R achieves competitive performance on the BOP benchmark across seven diverse datasets, matching or surpassing existing refinement-free methods. Additionally, Pos3R integrates seamlessly with render-and-compare refinement techniques, demonstrating adaptability for high-precision applications.
Poster
Tao Tan · Qiulei Dong

[ ExHall D ]

Abstract
Self-supervised 6D object pose estimation has received increasing attention in computer vision recently. Some typical works in literature attempt to translate the synthetic images with object pose labels generated by object CAD models into the real domain, and then use the translated data for training. However, their performance is generally limited, since (i) there still exists a domain gap between the translated images and the real images and (ii) the translated images can not sufficiently reflect occlusions that exist in many real images. To address these problems, we propose an Occlusion-Aware Neural Domain Adaptation method for self-supervised 6D object Pose estimation, called ONDA-Pose. The proposed method comprises three main steps. Firstly, by utilizing both the training real images without pose labels and a CAD model, we explore a CAD-like radiance field for rendering corresponding synthetic images that have similar textures to those generated by the CAD model. Then, a backbone pose estimator trained on the synthetic data is employed to provide initial pose estimations for the synthetic images rendered from the CAD-like radiance field, and the initial object poses are refined by a global object pose refiner to generate pseudo object pose labels. Finally, the backbone pose estimator is further …
Poster
Junning Qiu · Minglei Lu · Fei Wang · Yu Guo · Yonggen Ling

[ ExHall D ]

Abstract
Stereo-based category-level shape and 6D pose estimation methods have the potential to generalize to a wider range of materials than RGB-D methods, which often suffer from depth measurement errors. However, without explicit depth from two views, parameters to be estimated can become inherently entangled, negatively impacting performance.To address this, we propose a method that leverages global stereo consistency to constrain optimization directions and mitigate parameter entanglement.We first estimate an intra-category occupancy field to represent a unified shape across views, ensuring consistency and preventing shape ambiguity. Through a divide-and-conquer approach within global shape fitting, we fit this shape to stereo images to obtain the pose, iteratively rendering normalized depth maps and exchanging information across views. This approach improves convergence toward the correct pose and scale.We validated our method on both depth-friendly and depth-challenging materials using our S-RGBD dataset and the TOD benchmark. Our method surpasses RGBD methods on challenging objects and performs comparably on depth-friendly ones. Ablation studies confirm the effectiveness of each component.
Poster
Li Jin · Yujie Wang · Wenzheng Chen · Qiyu Dai · Qingzhe Gao · Xueying Qin · Baoquan Chen

[ ExHall D ]

Abstract
3D object canonicalization is a fundamental task, essential for a variety of downstream tasks. Existing methods rely on either cumbersome manual processes or priors learned from extensive, per-category training samples. Real-world datasets, however, often exhibit long-tail distributions, challenging existing learning-based methods, especially in categories with limited samples. We address this by introducing the first one-shot category-level object canonicalization framework, requiring only a single canonical model as a reference (the "prior model") for each category. To canonicalize any object, our framework first extracts semantic cues with large language models (LLMs) and vision-language models (VLMs) to establish correspondences with the prior model. We introduce a novel loss function to enforce geometric and semantic consistency, aligning object orientations precisely despite significant shape variations. Moreover, we adopt a support-plane strategy to reduce search space for initial poses and utilize a semantic relationship map to select the canonical pose from multiple hypotheses. Extensive experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance and validate key design choices. Using our framework, we create the Canonical Objaverse Dataset (COD), canonicalizing 33K samples in the Objaverse-LVIS dataset, underscoring the effectiveness of our framework on handling large-scale datasets.
Poster
Zixuan Huang · Mark Boss · Aaryaman Vasishta · James Rehg · Varun Jampani

[ ExHall D ]

Abstract
We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables a probabilistic modeling of the ill-posed single-image 3D task, while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds.
Poster
Wenrui Cai · Qingjie Liu · Yunhong Wang

[ ExHall D ]

Abstract
Most current state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer backbone for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background patches require restricted modeling participation, while foreground, particularly boundary areas, need to be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code will be released for further research.
Poster
Haolin Qin · Tingfa Xu · Tianhao Li · Zhenxiang Chen · Tao Feng · Jianan Li

[ ExHall D ]

Abstract
UAV tracking faces significant challenges in real-world scenarios, such as small-size targets, complex backgrounds, and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this area has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale dataset for Multispectral UAV Single Object Tracking (MUST), which includes 250 video sequences spanning diverse environments and challenging scenarios, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which integrates spectral, spatial, and temporal features using spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area.
Poster
Bangyan Liao · Zhenjun Zhao · Haoang Li · Yi Zhou · Yingping Zeng · Hao Li · Peidong Liu

[ ExHall D ]

Abstract
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a “soft” association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs’ locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called GlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that GlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous …
Poster
Huijie Fan · Yu Qiao · Yihao Zhen · Tinghui Zhao · Baojie Fan · Qiang Wang

[ ExHall D ]

Abstract
The capability of tracking objects in low-light environments like nighttime is crucial for numerous real-world applications such as crowd behavior analysis and traffic scene understanding. However, previous Multi-Camera Multi-Target(MCMT) tracking methods are primarily focused on tracking during daytime with favorable lighting, shying away from low-light environments. The main difficulty of tracking under low light condition is lack of detailed visible appearance features. To address this issue, we incorporate the infrared modality into MCMT tracking framework to provide more useful information. We constructed the first Multi-modality(RGBT) Multi-camera Multi-target tracking dataset named M3Track, which contains sequences captured in low-light environments, laying a solid foundation for all-day multi-camera tracking. Based on the proposed dataset, we propose All-Day Multi-Camera Multi-Target tracking network, termed as ADMCMT. Specifically, we propose an All-Day Mamba Fusion(ADMF) model to adaptively fuse information from different modalities. Within ADMF, the Lighting Guidance Model(IGM) extracts lighting relevant information to guide the fusion process. Furthermore, the Nearby Target Collection(NTC) strategy is designed to enhance tracking accuracy by leveraging information derived from surrounding objects of target. Experiments conducted on M3Track demonstrate that ADMCMT exhibits strong generalization across different lighting conditions. The code will be released soon.
Poster
Sunkyung Park · Jeongmin Lee · Dongjun Lee

[ ExHall D ]

Abstract
Shape abstraction, simplifying shape representation into a set of primitives, is a fundamental topic in computer vision. The choice of primitives shapes the structure of world understanding, yet achieving both high abstraction accuracy and versatility remains challenging. In this paper, we introduce a novel framework for shape abstraction utilizing a differentiable support function (DSF), which offers unique advantages in representing a wide range of convex shapes with fewer parameters, providing smooth surface approximation and enabling differentiable contact features (gap, point, normal) essential for downstream applications involving contact-related problems. To tackle the associated optimization and combinatorial challenges, we introduce two techniques: differentiable shape parameterization and hyperplane-based marching to enhance accuracy and reduce DSF requirements. We validate our method through experiments demonstrating superior accuracy and efficiency, and showcase its applicability in tasks requiring differentiable contact information.
Poster
Shaoming Li · Qing Cai · Songqi KONG · Runqing Tan · Heng Tong · Shiji Qiu · Yongguo Jiang · Zhi Liu

[ ExHall D ]

Abstract
Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3DSPL. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction …
Poster
Xinjun Li · Wenfei Yang · Jiacheng Deng · Zhixin Cheng · Xu Zhou · Tianzhu Zhang

[ ExHall D ]

Abstract
Image-to-point cloud registration aims to estimate the camera pose of a given image within a 3D scene point cloud. In this area, matching-based methods have achieved leading performance by first detecting the overlapping region, then matching point and pixel features learned by neural networks and finally using the PnP-RANSAC algorithm to estimate camera pose. However, achieving accurate image-to-point cloud registration remains challenging because the overlapping region detection is unreliable merely relying on point-wise classification, direct alignment of cross-modal data is difficult and indirect optimization objective leads to unstable registration results. To address these challenges, we propose a novel implicit correspondence learning method, including a Geometric Prior-guided overlapping region Detection Module (GPDM), an Implicit Correspondence Learning Module (ICLM), and a Pose Regression Module (PRM). The proposed method enjoys several merits. First, the proposed GPDM can precisely detect the overlapping region. Second, the ICLM can generate robust cross-modality correspondences. Third, the PRM can enable end-to-end optimization. Extensive experimental results on KITTI and nuScenes datasets demonstrate that the proposed model sets a new state-of-the-art performance in registration accuracy.
Poster
Rao Fu · Jianmin Zheng · Liang Yu

[ ExHall D ]

Abstract
The orientation of surface normals in 3D point cloud is a fundamental problem in computer vision and graphics. Determining a globally consistent orientation solely from the point cloud is however challenging due to the global scope of the problem and the discrete nature of point cloud, particularly in the presence of noise, outliers, holes, thin structures, and complex topologies.This paper presents an efficient, robust, and global algorithm for generating consistent normal orientation of a dense 3D point cloud. The basic idea is to transform the original binary normal orientation problem to finding a relaxed sign field on a Delaunay graph, which can be achieved by solving a sparse linear system. The Delaunay graph is constructed by triangulating a level set of an implicit function defined from the input point cloud. The shape diameter function is estimated to serve as a prior for determining an appropriate level value such that the level set implicitly defines the inner and outer shells enclosing the input point clouds. As such, our algorithm leverages the strengths of the shape diameter function, Delaunay triangulation, and the least-square techniques, making the underlying processes take both geometry and topology into consideration, and thus provides an efficient and robust …
Poster
Haobo Jiang · Jin Xie · Jian Yang · Liang Yu · Jianmin Zheng

[ ExHall D ]

Abstract
This paper introduces a novel task: zero-shot RGB-D point cloud registration, aimed at achieving robust 3D matching on in-the-wild data without any task-specific training. This task is both challenging and of high practical value. We present a powerful zero-shot RGB-D matching framework, {\em ZeroMatch}, which innovatively leverages the pre-trained large-scale vision model, Stable Diffusion, to address this challenge. Our core idea is to utilize the powerful zero-shot image representation of Stable Diffusion, achieved through extensive pre-training on large-scale data, to enhance point-cloud geometric descriptors for robust matching. Specifically, we combine the handcrafted geometric descriptor FPFH with Stable-Diffusion features to create point descriptors that are both locally and contextually aware, enabling reliable RGB-D registration with zero-shot capability. This approach is based on our observation that Stable-Diffusion features effectively encode discriminative global contextual cues, naturally alleviating the feature ambiguity that FPFH often encounters in scenes with repetitive patterns or low overlap. To further enhance cross-view consistency of Stable-Diffusion features for improved matching, we propose a coupled-image input mode that concatenates the source and target images into a single input, replacing the original single-image mode. This design achieves both inter-image and prompt-to-image consistency attentions, facilitating robust cross-view feature interaction and alignment. Finally, we …
Poster
Yi Du · Zhipeng Zhao · Shaoshu Su · Sharath Golluri · Haoze Zheng · Runmao Yao · Chen Wang

[ ExHall D ]

Abstract
Point cloud (PC) processing tasks—such as completion, upsampling, denoising, and colorization—are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others.Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing.We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.
Poster
Khanh Nguyen · Ghulam Mubashar Hassan · Ajmal Mian

[ ExHall D ]

Abstract
Recent open-world representation learning approaches have leveraged CLIP to enable zero-shot 3D object recognition. However, performance on real point clouds with occlusions still falls short due to the unrealistic pretraining settings. Additionally, these methods incur high inference costs because they rely on Transformer's attention modules. In this paper, we make two contributions to address these limitations. First, we propose occlusion-aware text-image-point cloud pretraining to reduce the training-testing domain gap. From 52K synthetic 3D objects, our framework generates nearly 630K partial point clouds for pretraining, consistently improving real-world recognition performances of existing popular 3D networks. Second, to reduce computational requirements, we introduce DuoMamba, a two-stream linear state space model tailored for point clouds. By integrating two space-filling curves with 1D convolutions, DuoMamba effectively models spatial dependencies between point tokens, offering a powerful alternative to Transformer. When pretrained with our framework, DuoMamba surpasses current state-of-the-art methods while reducing latency and FLOPs, highlighting the potential of our approach for real-world applications. We will release our data and code to facilitate future research.
Poster
Yaohua Zha · Yanzi Wang · Hang Guo · Jinpeng Wang · Tao Dai · Bin Chen · Zhihao Ouyang · Xue Yuerong · Ke Chen · Shu-Tao Xia

[ ExHall D ]

Abstract
Recently, applying pre-trained models to assist in 3D point cloud analysis has become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information present in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding.Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information.Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. The code will be released.
Poster
Boqian Zhang · shen yang · Hao Chen · Chao Yang · Jing Jia · Guang Jiang

[ ExHall D ]

Abstract
Point cloud upsampling can improve the quality of the initial point cloud, significantly enhancing the performance of downstream tasks such as classification and segmentation. Existing methods mostly focus on generating the geometric details of point clouds, neglecting noise suppression. To address this, we propose a novel network based on a conditional diffusion model, incorporating the Adaptive Noise Suppression (ANS) module, which we refer to as PDANS. The ANS module assigns weights to each point and determines the removal strategy based on these weights, reducing the impact of noisy points on the sampling process. The module first selects the neighborhood set for each point in the point cloud and performs a weighted sum between the point and its neighbors. It then adjusts the removal points based on the weighted sum, effectively mitigating the bias caused by outliers. We introduce the TreeTrans (TT) module to capture more correlated feature information. This module learns the interaction between high-level and low-level features, resulting in a more comprehensive and refined feature representation. Our results on several widely used benchmark datasets demonstrate that PDANS exhibits exceptional robustness in noisy point cloud processing and outperforms current state-of-the-art(SOTA) methods in terms of performance. Code is available at https://github.com/Baty2023/PDANS.
Poster
Zhaochong An · Guolei Sun · Yun Liu · Runjia Li · Junlin Han · Ender Konukoglu · Serge Belongie

[ ExHall D ]

Abstract
Generalized few-shot 3D point cloud segmentation (GFS-PCS) enables model adaptation to new classes with a few support samples while retaining base class segmentation. Existing GFS-PCS approaches focus on enhancing prototypes via interacting with support or query features but remain limited by the sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), designed to generalize across open-world novel classes by aligning with language models, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, to further utilize few-shot samples, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks …
Poster
Yujun Liu · Ruisheng Wang · Shangfeng Huang · GuoRong Cai

[ ExHall D ]

Abstract
Building reconstruction is a challenging problem at the intersection of computer vision, photogrammetry and computer graphics. 3D wireframe presents a compelling representation for building modeling through its compact structure. Existing wireframe reconstruction methods employing vertex detection and edge regression have achieved promising results. In this paper, we develop an \textbf{Edge}-aware \textbf{Diff}usion network, dubbed \textbf{EdgeDiff}. As a novel paradigm for wireframe reconstruction, the EdgeDiff generates wireframe models from noise using a conditional diffusion model. During the training process, the ground truth wireframes firstly are formulated as a set of parameterized edges and then diffused into a random noise distribution. EdgeDiff learns both the noise reversal process and the network structure simultaneously. During inference, EdgeDiff iteratively refines the generated edge distribution using the denoising diffusion implicit model, enabling flexible single- or multi-step denoising and dynamic adaptation to buildings of varying complexity. Additionally, given the unique structure of wireframes, we introduce an edge attention module to extract point-wise attention from point features, using it as auxiliary information to facilitate learning of edge cues and guide the network toward improved edge awareness. To the best of our knowledge, EdgeDiff is the first to pioneer the use of a diffusion model in building wireframe reconstruction. …
Poster
Yang Wu · Yun Zhu · Kaihua Zhang · Jianjun Qian · Jin Xie · Jian Yang

[ ExHall D ]

Abstract
3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity is quite limited. This paper presents **WeatherGen**, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which is capable of providing a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator with the spider mamba scan to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle and central ray, excellently maintaining the physical structure of the LiDAR point cloud. Subsequently, we design a latent domain aligner following the generator to transfer real-world knowledge. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision from CLIP, guiding the diffusion model in generating more discriminative data. Finally, we fine-tune WeatherGen with …
Poster
Chenxu Dang · Pei An · Xinmin Zhang · ZaiPeng Duan · Xuzhong Hu · Jie Ma

[ ExHall D ]

Abstract
Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, simple result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in a lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduce the overall complexity, resulting in more compact and information-dense temporal sequences. Furthermore, an innovative grouped hierarchical fusion strategy is proposed, progressively performing sequence scaling and intra-group fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at https://github.com/.
Poster
Dušan Malić · Christian Fruhwirth-Reisinger · Samuel Schulter · Horst Possegger

[ ExHall D ]

Abstract
While surface normals are widely used to analyse 3D scene geometry, surface normal estimation from LiDAR point clouds remains severely underexplored. This is caused by the lack of large-scale annotated datasets on the one hand, and lack of methods that can robustly handle the sparse and often noisy LiDAR data in a reasonable time on the other hand. We address these limitations using a traffic simulation engine and present LiSu, the first large-scale, synthetic LiDAR point cloud dataset with ground truth surface normal annotations, eliminating the need for tedious manual labeling. Additionally, we propose a novel method that exploits the spatiotemporal characteristics of autonomous driving data to enhance surface normal estimation accuracy. By incorporating two regularization terms, we enforce spatial consistency among neighboring points and temporal smoothness across consecutive LiDAR frames. These regularizers are particularly effective in self-training settings, where they mitigate the impact of noisy pseudo-labels, enabling robust real-world deployment. We demonstrate the effectiveness of our method on LiSu, achieving state-of-the-art performance in LiDAR surface normal estimation. Moreover, we showcase its full potential in addressing the challenging task of synthetic-to-real domain adaptation, leading to improved neural surface reconstruction on real-world data.
Poster
huang yongshu · Chen Liu · Minghang Zhu · Sheng Ao · Chenglu Wen · Cheng Wang

[ ExHall D ]

Abstract
LiDAR odometry is a critical module in autonomous driving systems, responsible for accurate localization by estimating the relative pose transformation between consecutive point cloud frames. However, existing studies frequently encounter challenges with unreliable pose estimation, due to the lack of in-depth understanding of scenario and the presence of noise interference. To address this challenge, we propose DiffLO, a semantic-aware LiDAR odometry network with diffusion-based refinement. To mitigate the impact of challenging cases such as dynamic, repetitive patterns, and low textures, we introduce a semantic distillation method that integrates semantic information into the odometry task. This allows the network to gain a semantic understanding of the scene, enabling it to focus more on the objects that are beneficial for pose estimation. Additionally, to enhance the robustness, we propose a diffusion-based refinement method. This method uses pose-related features as conditional constraints for generative diversity, iteratively refining the pose estimation to achieve greater accuracy. Comparative experiments on the KITTI odometry dataset demonstrate that the proposed method achieves the state-of-the-art performance. In particular, the proposed DiffLO is not only more robust than the classic method A-LOAM, but also has better generalization ability than existing learning-based methods. The code will be released.
Poster
Duc-Hai Pham · Tung Do · Phong Nguyen · Binh-Son Hua · Khoi Nguyen · Rang Nguyen

[ ExHall D ]

Abstract
We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth’s effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.
Poster
Haotong Lin · Sida Peng · Jingxiao Chen · Songyou Peng · Jiaming Sun · Minghuan Liu · Hujun Bao · Jiashi Feng · Xiaowei Zhou · Bingyi Kang

[ ExHall D ]

Abstract
Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets. Furthermore, we demonstrate that it benefits several downstream applications, including 3D reconstruction and generalized robotic grasping. Code will be released.
Poster
Xiaomeng Chu · Jiajun Deng · Guoliang You · Yifan Duan · Houqiang Li · Yanyong Zhang

[ ExHall D ]

Abstract
We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9\% mAP and 70.2\% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures …
Poster
Lei Lai · Zekai Yin · Eshed Ohn-Bar

[ ExHall D ]

Abstract
We present a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, addressing traditional limitations in VO algorithms associated with specific sensors and predefined settings. Our approach incorporates three main innovations. First, we introduce a language-based prior that infuses semantic information, enhancing robust feature extraction and enabling effective generalization to previously unseen domains. Second, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Third, we demonstrate that our flexible architecture can leverage an unconstrained, semi-supervised training process that iteratively adapts to new scenes using unlabeled data, further boosting its ability to generalize across diverse scenarios. We focus on autonomous driving contexts and validate our approach extensively on three standard benchmarks—KITTI, nuScenes, and Argoverse 2—as well as a newly generated, high-fidelity synthetic dataset from Grand Theft Auto (GTA). Our work advances the boundaries of VO applicability, offering a versatile solution for real-world deployment at scale.
Poster
You Wu · Xucheng Wang · Xiangyang Yang · Mengyuan Liu · Dan Zeng · Hengzhou Ye · Shuiwang Li

[ ExHall D ]

Abstract
Recently, there has been a significant rise in the use of single-stream architectures in visual tracking. These architectures effectively integrate feature extraction and fusion by leveraging pre-trained Vision Transformer (ViT) backbones. However, this framework is susceptible to target occlusion, a frequent challenge in Unmanned Aerial Vehicle (UAV) tracking due to the prevalence of buildings, mountains, trees, and other obstructions in aerial views. To our knowledge, there hasn't been exploration into learning occlusion-robust representations for UAV tracking within this framework. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks …
Poster
Jesse Hagenaars · Yilun Wu · Federico Paredes Valles · Stein Stroobants · Guido De Croon

[ ExHall D ]

Abstract
Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need to high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, onboard learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization learning pipeline. Benchmarking experiments show that the proposed pipeline achieves competitive results with the state of the art on the task of depth estimation from events. Furthermore, we demonstrate the usability of the learned depth for obstacle avoidance through real-world flight experiments. Finally, we compare the performance of different combinations of pre-training and fine-tuning of the depth estimation networks, showing that on-board domain adaptation is feasible given a few minutes of flight.
Poster
Shu-Wei Lu · Yi-Hsuan Tsai · Yi-Ting Chen

[ ExHall D ]

Abstract
Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of downstream autonomous driving tasks, such as forecasting and planning. However, the task is challenging because it is inherently an ill-posed problem due to the lack of depth information. Moreover, fusing multi-view images into a unified representation without depth cues becomes more challenging.Recent grid-based methods formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because they lack uncertainty modeling and are computationally expensive.In this work, we revisit depth-based methods and endow them with uncertainty awareness. Specifically, we calculate the variance of the depth distribution to represent how objects are spatially dispersed around their mean depth. In addition, the proposed model learns a soft depth mean and implicitly captures the spatial extent of objects. We transform the depth distribution into 3D Gaussians and utilize the rasterization technique to form uncertainty-aware BEV features.We evaluate our method on the nuScenes dataset, achieving state-of-the-art performance compared to depth-based methods. Notably, our model provides significant advantages in speed—running 2x faster—and in memory efficiency, using …
Poster
Gyeongrok Oh · Sungjune Kim · Heeju Ko · Hyunggun Chi · Jinkyu Kim · Dongwook Lee · Daehyun Ji · Sungjoon Choi · Sujin Jang · Sangpil Kim

[ ExHall D ]

Abstract
The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75% reduced voxel resolution.
Poster
Hyo-Jun Lee · Yeong Jun Koh · Hanul Kim · Hyunseop Kim · Yonguk Lee · Jinu Lee

[ ExHall D ]

Abstract
Vision-centric 3D Semantic Scene Completion (SSC) aims to reconstruct a fine-grained 3D scene from an input RGB image. Since the vision-centric 3D SSC is an ill-posed problem, it requires a precise 2D-3D view transformation. However, existing view transformations inevitably experience erroneous feature duplication in the reconstructed voxel space due to occlusions, leading to a dilution of informative contexts. Furthermore, semantic classes exhibit high variability in their appearance in real-world driving scenarios. To address these issues, we introduce a novel 3D SSC method, called SOAP, including two key components: an occluded region-aware view projection and a scene-adaptive decoder. The occluded region-aware view projection effectively converts 2D image features into voxel space, refining the duplicated features of occluded regions using information gathered from previous observations. The scene-adaptive decoder guides query embeddings to learn diverse driving environments based on a comprehensive semantic repository. Extensive experiments validate that the proposed SOAP significantly outperforms existing methods for the vision-centric 3D SSC on automated driving datasets, SemanticKITTI and SSCBench.
Poster
Yancong Lin · Shiming Wang · Liangliang Nan · Julian F. P. Kooij · Holger Caesar

[ ExHall D ]

Abstract
Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code will be released upon acceptance.
Poster
Haiming Zhang · Wending Zhou · Shenzhen The Chinese University of Hongkong · Hong Kong University of Science and Technology · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Shenzhen The Chinese University of Hong Kong

[ ExHall D ]

Abstract
This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
Poster
Kuang Wu · Chuan Yang · Zhanbin Li

[ ExHall D ]

Abstract
Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.
Poster
Wei Wu · Xi Guo · Weixuan TANG · Tingxuan Huang · Chiyu Wang · Chenjing Ding

[ ExHall D ]

Abstract
Recent advancements in generative models offer promising solutions for synthesizing realistic driving videos, aiding in training autonomous driving perception models. However, existing methods often struggle with high-resolution multi-view generation, mainly due to the significant memory and computational overhead caused by simultaneously inputting multi-view videos into denoising diffusion models.In this paper, we propose a driving video generation framework based on multi-view feature fusion named DriveScape for multi-view 3D condition-guided video generation. We introduce a Bi-Directional Modulated Transformer (BiMoT) module to encode, fuse and inject multi-view features along with various 3D road structures and objects, which enables high-resolution multi-view generation. Consequently, our approach allows precise control over video generation, greatly enhancing realism and providing a robust solution for creating high-quality, multi-view driving videos.Our framework achieves state-of-the-art results on the nuScenes dataset, demonstrating impressive generative quality metrics with an FID score of 8.34 and an FVD score of \textbf{76.39}, as well as superior performance across various perception tasks. This lays the foundation for more accurate environment simulation in autonomous driving. We plan to make our code and pre-trained model publicly available.Please refer to index.html webpage in the supplementary materials for more visualization results.
Poster
Changsheng Lv · Mengshi Qi · Liang Liu · Huadong Ma

[ ExHall D ]

Abstract
Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.
Poster
Luke Rowe · Roger Girgis · Anthony Gosselin · Liam Paull · Christopher Pal · Felix Heide

[ ExHall D ]

Abstract
We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene (comprising the lane graph and agent bounding boxes) and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networks that perform unnecessary computation due to many empty pixels in the rasterized scene. Moreover, we find that existing methods that employ rule-based agent behaviours lack diversity and realism. Scenario Dreamer instead comprises a novel vectorized latent diffusion model for initial traffic scene generation and a return-conditioned autoregressive Transformer for agent behaviour simulation. Scenario Dreamer supports scene extrapolation via diffusion inpainting, enabling the generation of unbounded simulation environments. We validate that Scenario Dreamer environments are more realistic than those generated from existing methods while requiring fewer parameters and lower generation latency. We confirm its practical utility by showing that RL-based planning agents are more challenged in Scenario Dreamer environments than traditional non-generative simulation environments, especially on long and adversarial driving environments. All code will be open-sourced.
Poster
Zhiwei Dong · Ran Ding · Wei Li · Zhang Peng · Guobin Tang · Jia Guo

[ ExHall D ]

Abstract
Latest trajectory prediction models in real-world autonomous driving systems often rely on online High-Definition (HD) maps to understand the road environment.However, online HD maps suffer from perception errors and feature redundancy, which hinder the performance of HD map-based trajectory prediction models.To address these issues, we introduce a framework, termed SD map-Augmented Trajectory Prediction (SATP), which leverages Standard-Definition (SD) maps to enhance HD map-based trajectory prediction models.First, we propose an SD-HD fusion approach to leverage SD maps across the diverse range of HD map-based trajectory prediction models. Second, we design a novel AlignNet to align the SD map with the HD map, further improving the effectiveness of SD maps. Experiments on real-world autonomous driving benchmarks demonstrate that SATP not only improves the performance of HD map-based trajectory prediction up to 25\% in real-world scenarios using online HD maps but also brings benefits in ideal scenarios with ground-truth HD maps.
Poster
Yi Yu · Weizhen Han · Libing Wu · Bingyi Liu · Enshu Wang · Zhuangzhuang Zhang

[ ExHall D ]

Abstract
Trajectory prediction plays a crucial role in autonomous driving systems, and exploring its vulnerability has garnered widespread attention. However, existing trajectory prediction attack methods often rely on single-point attacks to make efficient perturbations. This limits their applications in real-world scenarios due to the transient nature of single-point attacks, their susceptibility to filtration, and the uncertainty regarding the deployment environment. To address these challenges, this paper proposes a novel LiDAR-induced attack framework to impose multi-frame attacks by optimization-driven adversarial location search, achieving endurance, efficiency, and robustness. This framework strategically places objects near the adversarial vehicle to implement an attack and introduces three key innovations. First, successive state perturbations are generated using a multi-frame single-point attack strategy, effectively misleading trajectory predictions over extended time horizons. Second, we efficiently optimize adversarial objects' locations through three specialized loss functions to achieve desired perturbations. Lastly, we improve robustness by treating the adversarial object as a point without size constraints during the location search phase and reduce dependence on both the specific attack point and the adversarial object's properties. Extensive experiments confirm the superior performance and robustness of our framework.
Poster
Dongkun Zhang · Jiaming Liang · Ke Guo · Sha Lu · Qi Wang · Rong Xiong · Zhenwei Miao · Yue Wang

[ ExHall D ]

Abstract
Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios.In this paper, we introduce CarPlanner, a Consistent auto-regressive Planner that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance.Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving.To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.
Poster
Wufei Ma · Luoxin Ye · Nessa McWeeney · Celso M. de Melo · Alan L. Yuille · Jieneng Chen

[ ExHall D ]

Abstract
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing 3DI-LMM, an LMM with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporates 3D orientation relationships. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our 3DI-LMM advances machines toward highly capable 3D-informed reasoning, surpass GPT-4o performance by 8.7%. Our systematic empirical design and resulting findings offer valuable insights for future research in this direction.
Poster
Zhenhua Xu · Yan Bai · Yujia Zhang · Zhuoling Li · Fei Xia · Kwan-Yee K. Wong · Jianqiang Wang · Hengshuang Zhao

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) possess the ability to comprehend visual images or videos, and show impressive reasoning ability thanks to the vast amounts of pretrained knowledge, making them highly suitable for autonomous driving applications. Unlike the previous work, DriveGPT4-V1, which focused on open-loop tasks, this study explores the capabilities of LLMs in enhancing closed-loop autonomous driving. DriveGPT4-V2 processes camera images and vehicle states as input to generate low-level control signals for end-to-end vehicle operation. A high-resolution visual tokenizer (HR-VT) is employed enabling DriveGPT4-V2 to perceive the environment with an extensive range while maintaining critical details. The model architecture has been refined to improve decision prediction and inference speed. To further enhance the performance, an additional expert LLM is trained for online imitation learning. The expert LLM, sharing a similar structure with DriveGPT4-V2, can access privileged information about surrounding objects for more robust and reliable predictions. Experimental results show that DriveGPT4-V2 significantly outperforms all baselines on the challenging CARLA Longest6 benchmark. The code and data of DriveGPT4-V2 will be publicly available.
Poster
Ahmad Rahimi · Po-Chien Luan · Yuejiang Liu · Frano Rajič · Alex Alahi

[ ExHall D ]

Abstract
Modeling spatial-temporal interactions among neighboring agents is at the heart of multi-agent problems such as motion forecasting and crowd navigation. Despite notable progress, it remains unclear to which extent modern representations can capture the causal relationships behind agent interactions. In this work, we take an in-depth look at the causal awareness of these representations, from computational formalism to real-world practice. First, we revisit the notion of non-causal robustness studied in the recent CausalAgents benchmark. We show that existing representations are already partially resilient to perturbations of non-causal agents, and yet modeling indirect causal effects involving mediator agents remains challenging. To address this challenge, we introduce a metric learning approach that regularizes latent representations with causal annotations. Our controlled experiments show that this approach not only leads to higher degrees of causal awareness but also yields stronger out-of-distribution robustness. To further operationalize it in practice, we propose a sim-to-real causal transfer method via cross-domain multi-task learning. Experiments on trajectory prediction datasets show that our method can significantly boost generalization, even in the absence of real-world causal annotations, where we acquire higher prediction accuracy by only using 25% of real-world data. We hope our work provides a new perspective on the challenges …
Poster
Yuxiang Fu · Qi Yan · Ke Li · Lele Wang · Renjie Liao

[ ExHall D ]

Abstract
In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues.We propose a novel conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene.We design a novel flow matching loss function that not only ensures at least one of the K sets of future trajectories is accurate but also encourages all K sets of future trajectories to be diverse and plausible.Furthermore, leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA game, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance, generating diverse trajectories that are physically and socially plausible.Moreover, the one-step student model is significantly faster than the teacher flow model in sampling.
Poster
Yuncong Yang · Han Yang · Jiachen Zhou · Peihao Chen · Hongxin Zhang · Yilun Du · Chuang Gan

[ ExHall D ]

Abstract
Constructing a compact and informative scene representation for 3D scenes is essential for effective embodied exploration and reasoning, especially in complex environments over long periods. Existing scene representations, such as object-centric 3D scene graphs, have significant limitations. They oversimplify spatial relationships by modeling scenes as individual objects, with inter-object relationships described by restrictive texts, making it difficult to answer queries that require nuanced spatial understanding. Furthermore, these representations lack natural mechanisms for active exploration and memory management, which hampers their application to lifelong autonomy. In this work, we propose SnapMem, a novel snapshot-based scene representation serving as 3D scene memory for embodied agents. SnapMem employs informative images, termed Memory Snapshots, to capture rich visual information of explored regions. It also integrates frontier-based exploration by introducing Frontier Snapshots—glimpses of unexplored areas—that enable agents to make informed exploration decisions by considering both known and potential new information. Meanwhile, to support lifelong memory in active exploration settings, we further present an incremental construction pipeline for SnapMem, as well as an effective memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that SnapMem significantly enhances agents' exploration and reasoning capabilities in 3D environments over extended periods, highlighting its potential for advancing …
Poster
Xingyu Chen · Zhuheng Song · Xiaoke Jiang · Yaoqing Hu · Junzhi Yu · Lei Zhang

[ ExHall D ]

Abstract
Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6% PCK@0.05 on HInt-Ego4D.
Poster
Zifan Wang · Ziqing Chen · Junyu Chen · Jilong Wang · Yuxin Yang · Yunze Liu · Xueyi Liu · He Wang · Li Yi

[ ExHall D ]

Abstract
This paper introduces MobileH2R, a framework for learning generalizable vision-based human-to-mobile-robot (H2MR) handover skills. Unlike traditional fixed-base handovers, this task requires a mobile robot to reliably receive objects in a large workspace enabled by its mobility. Our key insight is that generalizable handover skills can be developed in simulators using high-quality synthetic data, without the need for real-world demonstrations. To achieve this, we propose a scalable pipeline for generating diverse synthetic full-body human motion data, an automated method for creating safe and imitation-friendly demonstrations, and an efficient 4D imitation learning method for distilling large-scale demonstrations into closed-loop policies with base-arm coordination. Experimental evaluations in both simulators and the real world show significant improvements (at least +15% success rate) over baseline methods in all cases. Experiments also validate that large-scale and diverse synthetic data greatly enhances robot learning, highlighting our scalable framework.
Poster
Jingyi Tian · Le Wang · Sanping Zhou · Sen Wang · lijiayi · Haowen Sun · Wei Tang

[ ExHall D ]

Abstract
Robotic manipulation based on visual observations and natural language instructions is a long-standing challenge in robotics. Yet prevailing approaches model action distribution by adopting explicit or implicit representations, which often struggle to achieve a trade-off between accuracy and efficiency. In response, we propose PDFactor, a novel framework that models action distribution with a hybrid triplane representation. In particular, PDFactor decomposes 3D point cloud into three orthogonal feature planes and leverages a tri-perspective view transformer to produce dense cubic features as a latent diffusion field aligned with observation space representing 6-DoF action probability distribution at an arbitrary location. We employ a small denoising network conceptually as both a parameterized loss function measuring the quality of the learned latent features and an action gradient decoder to sample actions from the latent diffusion field during inference. This design enables our PDFactor to benefit from spatial awareness of explicit representation and arbitrary resolution of implicit representation, rendering it with manipulation accuracy, inference efficiency, and model scalability. Experiments demonstrate that PDFactor outperforms state-of-the-art approaches across a diverse range of manipulation tasks in RLBench simulation. Moreover, PDFactor can effectively learn multi-task policies from a limited number of human demonstrations, achieving promising accuracy in a variety of …
Poster
Jieming Cui · Tengyu Liu · Ziyu Meng · Jiale Yu · Ran Song · Wei Zhang · Yixin Zhu · Siyuan Huang

[ ExHall D ]

Abstract
Learning open-vocabulary physical skills for simulated agents remains challenging due to the limitations of reinforcement learning approaches: manually designed rewards lack scalability, while demonstration-based methods struggle to cover arbitrary tasks. We propose GROVE, a generalized reward framework for open-vocabulary physical skill learning without manual reward design or task-specific demonstrations. GROVE uniquely combines Large Language Models (LLMs) for generating precise constraints with Vision Language Models (VLMs) for semantic evaluation. Through an iterative reward design process, VLM-based feedback guides the refinement of LLM-generated constraints, significantly enhancing the reliability of our method. Central to our approach is Pose2CLIP, a lightweight pose-to-semantic feature mapper that significantly enhances the quality and efficiency of VLM evaluation. Extensive experiments demonstrate GROVE's versatility across diverse tasks and learning paradigms. Our approach achieves 22.2% higher naturalness and 25.7% better task completion score while training 8.4 times faster than previous open-vocabulary methods, establishing a new foundation for scalable physical skill acquisition.
Poster
Chan Hee Song · Valts Blukis · Jonathan Tremblay · Stephen Tyree · Yu Su · Stan Birchfield

[ ExHall D ]

Abstract
Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension — spatial relationships require clear contextual understanding, whether from a ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and ego-centric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
Poster
Yawen Shao · Wei Zhai · Yuhang Yang · Hongchen Luo · Yang Cao · Zheng-Jun Zha

[ ExHall D ]

Abstract
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (Geometry-Intention Collaborative Inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT. The code and dataset will be released.
Poster
He Zhu · Quyu Kong · Kechun Xu · Xunlong Xia · Bing Deng · Jieping Ye · Rong Xiong · Yue Wang

[ ExHall D ]

Abstract
Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings.
Poster
Yueru Jia · Jiaming Liu · Sixiang Chen · Chenyang Gu · Zhilve Wang · Xiaoqi Li · Longzan Luo · Pengwei Wang · Renrui Zhang · Zhongyuan Wang · Shanghang Zhang

[ ExHall D ]

Abstract
3D geometric information is essential for manipulation tasks, as robots need to perceive the 3D environment, reason about spatial relationships, and interact with intricate spatial configurations. Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. Specifically, we first design a task-aware masked autoencoder that masks task-relevant affordance patches and reconstructs depth information, enhancing the 2D foundation model’s implicit 3D robotic representation. After self-supervised fine-tuning, we introduce a 2D model-lifting strategy that establishes a positional mapping between the input 3D points and the positional embeddings of the 2D model. Based on the mapping, Lift3D utilizes the 2D foundation model to directly encode point cloud data, leveraging large-scale pretrained knowledge to construct explicit 3D robotic representations while minimizing spatial information loss. In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.
Poster
Mingjie Pan · Jiyao Zhang · Tianshu Wu · Yinghao Zhao · Wenlong Gao · Hao Dong

[ ExHall D ]

Abstract
The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
Poster
Tomoya Yoshida · Shuhei Kurita · Taichi Nishimura · Shinsuke Mori

[ ExHall D ]

Abstract
Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets --- constructed globally with substantial effort --- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision. Our dataset and code is available upon acceptance.
Poster
Yu Qi · Yuanchen Ju · Tianming Wei · Chi Chu · Lawson L.S. Wong · Huazhe Xu

[ ExHall D ]

Abstract
3D assembly tasks, such as furniture assembly and component fitting, play a crucial role in daily life and represent essential capabilities for future home robots. Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts, which fall short in addressing the complexities of everyday object interactions and assemblies. To bridge this gap, we present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly, covering 18 fine-grained tasks that reflect real-life scenarios, such as plugging into sockets, arranging flowers in vases, and inserting bread into toasters. The 2BY2 dataset contains 1,034 different instances and 517 pairwise objects with pose and symmetry annotations, requiring approaches that align geometric shapes while accounting for functional and spatial relationships between objects. Leveraging the 2BY2 dataset, we introduce a multi-step paired SE(3) pose estimation method that utilizes equivariant geometric features to enforce assembly constraints. Compared to previous shape assembly methods, our approach achieves state-of-the-art performance across all 18 tasks in the 2BY2 dataset, reducing translation RMSE by an average of 0.046 and rotation RMSE by 11.44 across both inter-category and intra-category tasks. Additionally, robot experiments further validate the reliability and generalization ability of our method for complex 3D assembly tasks.
Poster
Qi Lv · Hao Li · Xiang Deng · Rui Shao · Yinchuan Li · Jianye Hao · Longxiang Gao · MICHAEL YU WANG · Liqiang Nie

[ ExHall D ]

Abstract
Despite the significant success of imitation learning in robotic manipulation, its application to bimanual tasks remains highly challenging. Existing approaches mainly learn a policy to predict a distant next-best end-effector pose (NBP) and then compute the corresponding joint rotation angles for motion using inverse kinematics. However, they suffer from two important issues: (1) rarely considering the physical robotic structure}, which may cause self-collisions or interferences, and (2) overlooking the kinematics constraint}, which may result in the predicted poses not conforming to the actual limitations of the robot joints. In this paper, we propose Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser). Specifically, (1)to incorporate the physical robot structure information into action prediction, KStar Diffuser maintains a dynamic spatial-temporal graph according to the physical bimanual joint motions at continuous timesteps. This dynamic graph serves as the robot-structure condition for denoising the actions; (2) to make the NBP learning objective consistent with kinematics, we introduce the differentiable kinematics to provide the reference for optimizing KStar Diffuser. This module regularizes the policy to predict more reliable and kinematics-aware next end-effector poses. Experimental results show that our method effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world.
Poster
Shun Iwase · Zubair Irshad · Katherine Liu · Vitor Guizilini · Robert Lee · Takuya Ikeda · Ayako Amma · Koichi Nishiwaki · Kris Kitani · Rares Andrei Ambrus · Sergey Zakharov

[ ExHall D ]

Abstract
Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which is an order of magnitude larger than existing datasets and comprises 1M photo-realistic images, high-resolution 3D reconstructions and 8.9B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects even when trained only on synthetic data.
Poster
Muchen Li · Sammy Christen · Chengde Wan · Yujun Cai · Renjie Liao · Leonid Sigal · Shugao Ma

[ ExHall D ]

Abstract
Current research on generating 3D hand-object interaction motion primarily focuses on in-domain objects. Generalization to unseen objects is essential for practical applications, yet it remains both challenging and largely unexplored.In this paper, we propose LatentHOI, a novel approach designed to tackle the challenges of generalizing hand-object interaction to unseen objects.Our main insight lies in decoupling high-level temporal motion from fine-grained spatial hand-object interactions with a latent diffusion model coupled with a Grasping Variational Autoencoder (GraspVAE). This configuration not only enhances the conditional dependency between spatial grasp and temporal motion but also improves data utilization and reduces overfitting through regularization in the latent space. We conducted extensive experiments in an unseen-object setting on both single-hand grasping and bi-manual motion datasets, including GRAB, DexYCB, and OakInk.Quantitative and qualitative evaluations demonstrate that our method significantly enhances the realism and physical plausibility of generated motions for unseen objects, both in single and bimanual manipulations, compared to the state-of-the-art.
Poster
Boran Wen · Dingbang Huang · Zichen Zhang · Jiahong Zhou · Jianbin Deng · Jingyu Gong · Yulong Chen · Lizhuang Ma · Yonglu Li

[ ExHall D ]

Abstract
Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images.We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions.Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work.
Poster
Jeongwan On · Kyeonghwan Gwak · Gunyoung Kang · Junuk Cha · Soohyun Hwang · Hyein Hwang · Seungryul Baek

[ ExHall D ]

Abstract
Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (ie. MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in …
Poster
Kefan Chen · Chaerin Min · Linguang Zhang · Shreyas Hampali · Cem Keskin · Srinath Sridhar

[ ExHall D ]

Abstract
Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.
Poster
Rao Fu · Dingxi Zhang · Alex Jiang · Wanjia Fu · Austin Funk · Daniel Ritchie · Srinath Sridhar

[ ExHall D ]

Abstract
Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.
Poster
Buzhen Huang · Chen Li · Chongyang Xu · Dongyue Lu · Jinnan Chen · Yangang Wang · Gim Hee Lee

[ ExHall D ]

Abstract
Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data will be publicly available …
Poster
Jin Lyu · Tianyi Zhu · Yi Gu · Li Lin · Pujin Cheng · Yebin Liu · Xiaoying Tang · Liang An

[ ExHall D ]

Abstract
Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate …
Poster
Andrea Boscolo Camiletto · Jian Wang · Eduardo Alvarado · Rishabh Dabral · Thabo Beeler · Marc Habermann · Christian Theobalt

[ ExHall D ]

Abstract
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data.Existing methods heavily rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs.Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input -- device pose and camera feeds -- is challenging due to the differing characteristics of each data source.To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware.Lastly, we showcase a novel training strategy to enhance the model's generalization capabilities.Our approach exploits the problem's geometric properties, yielding high-quality motion capture free from common artifacts in prior work. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method.We will release data, code, and CAD designs for the …
Poster
Hyunjun Lee · Hyunsoo Lee · Sookwan Han

[ ExHall D ]

Abstract
There have been many attempts to leverage multiple diffusion models for collaborative generations, extending beyond the original domain. One prominent approach is synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing approaches rely on heuristics such as averaging for synchronizing trajectories. Such approaches do not clarify WHY such methods work, and also create many failure cases when the heuristic used on one task is naively applied to other tasks. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works, and prove that heuristics model the correlations between multiple trajectories, hence must be adapted accordingly to each task the synchronization takes place. We attempt to find the best correlation models for each tasks, giving the best results compared to previous approaches which naively applies single heuristics to every tasks without reasoning.
Poster
Jiaman Li · Karen Liu · Jiajun Wu

[ ExHall D ]

Abstract
Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion---including both joint rotations and root trajectories in the world coordinate system---using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
Poster
Kenkun Liu · Yurong Fu · Weihao Yuan · Jing Lin · Peihao Li · Xiaodong Gu · Lingteng Qiu · Haoqian Wang · Zilong Dong · Xiaoguang Han

[ ExHall D ]

Abstract
Existing methods for capturing multi-person holistic human motions from a monocular video usually involve integrating the detector, the tracker, and the human pose & shape estimator into a cascaded system. Differently, we develop a one-stage multi-person holistic human motion capture system, which 1) employs only one network, enabling significant benefits from the end-to-end training on a large-scale dataset; 2) enables performance improving of the tracking module during training, avoiding being limited by a pre-trained tracker; 3) captures the motions of all individuals within a single shot, rather than tracking and estimating each person sequentially. In this system, each query within a temporal cross-attention module is responsible for the long motion of a specific individual, implicitly aggregating individual-specific information throughout the entire video. To further boost the proposed system from end-to-end training, we also construct a synthetic human video dataset, with multi-person and whole-body annotations. Extensive experiments across different datasets demonstrate both the efficacy and the efficiency of both the proposed method and the dataset. The code of our method will be made publicly available.
Poster
Yinhuai Wang · Qihan Zhao · Runyi Yu · Hok Wai Tsui · Ailing Zeng · Jing Lin · Zhengyi Luo · Jiwen Yu · Xiu Li · Qifeng Chen · Jian Zhang · Lei Zhang · Ping Tan

[ ExHall D ]

Abstract
Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose ISMimic, the first data-driven approach that Mimics both human and ball motions to learn diverse Interaction Skills, e.g., a wide variety of challenging basketball skills. ISMimic employs a unified configuration to learn diverse interaction skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple interaction skills and allows smooth skill switching. The interaction skills acquired by ISMimic can be easily reused by a high-level controller to accomplish high-level tasks. To evaluate our approach, we introduce two basketball datasets that collectively contain about 35 minutes of diverse basketball skills. Experiments show that our method can effectively acquire various reusable basketball skills including diverse styles of dribbling, layup, and shooting. Video results and 3D visualization are available at https://ismimic.github.io
Poster
Yingying Fan · Quanwei Yang · Kaisiyuan Wang · Hang Zhou · Yingying Li · Haocheng Feng · Errui Ding · Yu Wu · Jingdong Wang

[ ExHall D ]

Abstract
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To cope with these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we newly design an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout-adjusting strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed Re-HOLD framework significantly outperforms existing methods.
Poster
Peishan Cong · Ziyi Wang · Yuexin Ma · Xiangyu Yue

[ ExHall D ]

Abstract
Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios.
Poster
Xuan Li · Qianli Ma · Tsung-Yi Lin · Yongxin Chen · Chenfanfu Jiang · Ming-Yu Liu · Donglai Xiang

[ ExHall D ]

Abstract
We present Articulated Motion Distillation (AMD), a framework for generating high-fidelity character animations by merging the strengths of skeleton-based animation and modern generative models. AMD uses a skeleton-based representation for rigged 3D assets, drastically reducing the Degrees of Freedom (DoFs) by focusing on joint-level control, which allows for efficient, consistent motion synthesis. Through Score Distillation Sampling (SDS) with pre-trained video diffusion models, AMD distills complex, articulated motions while maintaining structural integrity, overcoming challenges faced by 4D neural deformation fields in preserving shape consistency. This approach is naturally compatible with physics-based simulation, ensuring physically plausible interactions. Experiments show that AMD achieves superior 3D consistency and expressive motion quality compared with existing works on text-to-4D generation.
Poster
Lei Li · Sen Jia · Jianhao Wang · Zhongyu Jiang · Feng Zhou · Ju Dai · Tianfang Zhang · Zongkai Wu · Jenq-Neng Hwang

[ ExHall D ]

Abstract
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model’s ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction.
Poster
Jianrong Zhang · Hehe Fan · Yi Yang

[ ExHall D ]

Abstract
Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task. Our implementation will be released.
Poster
Chi-Pin Huang · Yen-Siang Wu · Hung-Kai Chung · Kai-Po Chang · Fu-En Yang · Yu-Chiang Frank Wang

[ ExHall D ]

Abstract
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions. Our code will be available upon acceptance.
Poster
Kumar Ashutosh · Georgios Pavlakos · Kristen Grauman

[ ExHall D ]

Abstract
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames—capturing physically ungrounded predictions of “what” and ignoring the “where” and “how”. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FICTION that fuses the past video observation of the person’s actions and their environment to predict both the “where” and “how” of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
Poster
Jiuming Liu · Jinru Han · Lihao Liu · Angelica I Aviles-Rivero · Chaokang Jiang · Zhe Liu · Hesheng Wang

[ ExHall D ]

Abstract
Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and ×5.36 speed-up. …
Poster
Vadim Tschernezki · Diane Larlus · Andrea Vedaldi · Iro Laina

[ ExHall D ]

Abstract
Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and segmenting moving objects in particular. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (layered motion fusion). Additionally, we observe that the fusion is limited due to the high complexity of very long dynamic videos, resulting in a lack of captured geometry and in turn a constrained fusion of motion into the (missing) dynamic scene geometry. We address this issue through test-time refinement, which helps the model to focus …
Poster
Haoyue Liu · Jinghan Xu · Yi Chang · Hanyu Zhou · Haozhi Zhao · Lin Wang · Luxin Yan

[ ExHall D ]

Abstract
Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras' advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse motion fields or fuse events with image features to estimate dense motion fields. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events does not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated …
Poster
Zhengfei Kuang · Tianyuan Zhang · Kai Zhang · Hao Tan · Sai Bi · Yiwei Hu · Zexiang Xu · Milos Hasan · Gordon Wetzstein · Fujun Luan

[ ExHall D ]

Abstract
We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
Poster
Min Wu Jeong · Chae Eun Rhee

[ ExHall D ]

Abstract
In this paper, we propose LC-Mamba, a Mamba-basedmodel that captures fine-grained spatiotemporal infor-mation in video frames, addressing limitations in cur-rent interpolation methods and enhancing performance.The main contributions are as follows: First, we apply ashifted local window technique to reduce historical de-cay and enhance local spatial features, allowing multi-scale capture of detailed motion between frames. Sec-ond, we introduce a Hilbert curve-based selective statescan to maintain continuity across window boundaries,preserving spatial correlations both within and betweenwindows. Third, we extend the Hilbert curve to enablevoxel-level scanning to effectively capture spatiotempo-ral characteristics between frames. The proposed LC-Mamba achieves competitive results, with a PSNR of36.53 dB on Vimeo-90k, outperforming prior models by+0.03 dB. The code and models are publicly available athttps://anonymous.4open.science/r/LC-Mamba-FE7C
Poster
Xin Yu · Tianyu Wang · Soo Ye Kim · Paul Guerrero · Xi Chen · Qing Liu · Zhe Lin · Xiaojuan Qi

[ ExHall D ]

Abstract
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage their knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
Poster
Juil Koo · Paul Guerrero · Chun-Hao P. Huang · Duygu Ceylan · Minhyuk Sung

[ ExHall D ]

Abstract
Generative methods for image and video editing use generative models as priors to perform edits despite incomplete information, such as changing the composition of 3D objects shown in a single image. Recent methods have shown promising composition editing results in the image setting, but in the video setting, editing methods have focused on editing object appearance, object motion, or camera motion, and as a result, methods to edit object composition in videos are still missing. We propose VideoHandles as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner. This is achieved by lifting intermediate features of a generative model to a 3D reconstruction that is shared between all frames, editing the reconstruction, and projecting the features on the edited reconstruction back to each frame. To the best of our knowledge, this is the first generative approach to edit object compositions in videos. Our approach is simple and training-free, while outperforming state-of-the-art image editing baselines.
Poster
Jiarui Xu · Shihao Han · Karan Dalal · Daniel Koceja · Yue Zhao · Ka Chun Cheung · Yejin Choi · Jan Kautz · Yu Sun · Xiaolong Wang

[ ExHall D ]

Abstract
We present a novel framework for generating long-form cartoon videos, specifically focusing on recreating the classic "Tom and Jerry" series. While recent advances in video generation have shown promising results for short clips, generating long videos with coherent storylines and dynamic motions remains challenging with high computation costs. We propose a hybrid framework that combines local self-attention with a Test-Time Training (TTT) based global attention mechanism, enabling our model to process and maintain consistency across significantly longer temporal context windows. We develop a new dataset curation pipeline specifically designed for long-form cartoon videos, combining human annotations for complex motion dynamics with Vision-Language Models for detailed descriptions. Our pipeline captures the exaggerated movements and dynamic camera work characteristic of "Tom and Jerry". Experiments show that our approach outperforms existing methods in generating long-form animated content with plausible motion and consistent storylines.
Poster
Shaoteng Liu · Tianyu Wang · Jui-Hsien Wang · Qing Liu · Zhifei Zhang · Joon-Young Lee · Yijun Li · Bei Yu · Zhe Lin · Soo Ye Kim · Jiaya Jia

[ ExHall D ]

Abstract
Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a \textbf{generative video propagation framework}, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object’s shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed …
Poster
Chaoyang Wang · Peiye Zhuang · Tuan Duc Ngo · Willi Menapace · Aliaksandr Siarohin · Michael Vasilkovsky · Ivan Skorokhodov · Sergey Tulyakov · Peter Wonka · Hsin-Ying Lee

[ ExHall D ]

Abstract
We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a newly designed synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization.This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore, GIM-Confidence, and Dust3R-Confidence).
Poster
Guodong Ding · Rongyu Chen · Angela Yao

[ ExHall D ]

Abstract
This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500× while retaining 83\% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.
Poster
Muhammad Umar Karim Khan · Aaron Chadha · Mohammad Ashraful Anam · Yiannis Andreopoulos

[ ExHall D ]

Abstract
Standard video codecs are rate-distortion optimization machines, where distortion is typically quantified using PSNR versus the source. However, it is now widely accepted that increasing PSNR does not necessarily translate to better visual quality. In this paper, a better balance between perception and fidelity is targeted, in order to provide for significant rate savings over state-of-the-art standards-based video codecs. Specifically, pre- and postprocessing neural networks are proposed that enhance the coding efficiency of standard video codecs when benchmarked with an array of well-established perceptual quality scores. These neural wrapper'' elements are end-to-end trained with a neural codec module serving as a differentiable proxy for standard video codecs. The codec proxy is jointly optimized with the pre- and post components, via a novel two-phase pretraining strategy and end-to-end iterative refinement with stop-gradient. This allows the neural pre- and postprocessor to learn to embed, remove and recover information in a codec-aware manner, thus improving its rate-quality performance. A single neural-wrapper model is thereby established and used for the entire rate-quality curve without needing any downscaling or upscaling. The trained model is tested with the AV1 and VVC standard codecs via an array of well-established objective quality scores (SSIM, MS-SSIM, VMAF, AVQT), as …
Poster
Shuoyan Wei · Feng Li · Shengeng Tang · Yao Zhao · Huihui Bai

[ ExHall D ]

Abstract
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic range, exhibit compelling promise in vision tasks. This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. Our approach hinges on two pivotal components: 1) Event-adapted synthesis capitalizes on the spatiotemporal correlations between frames and events to discern and learn long-term motion trajectories, enabling the adaptive interpolation and fusion of informative spatiotemporal features; 2) Local implicit video transformer integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations utilized to generate plausible videos at arbitrary resolutions and frame rates. Experiments show that EvEnhancer achieves superiority on synthetic and real-world datasets and preferable generalizability on out-of-distribution scales against state-of-the-art methods.
Poster
Huimin Zeng · Jiacheng Li · Zhiwei Xiong

[ ExHall D ]

Abstract
As a widely adopted technique in data transmission, video compression effectively reduces the size of files, making it possible for real-time cloud computing. However, it comes at the cost of visual quality, posing challenges to the robustness of downstream vision models. In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. Specifically, the proposed codec-aware framework consists of a compression-aware adaptation (CAA) network that employs a hierarchical adaptation mechanism to estimate parameters of the frame-wise enhancement network, namely the bitstream-aware enhancement (BAE) network. The BAE network further leverages temporal and spatial priors embedded in the bitstream to effectively improve the quality of compressed input frames. Extensive experimental results demonstrate the superior quality enhancement performance of our framework over existing enhancement methods, as well as its versatility in assisting multiple downstream tasks on compressed videos as a plug-and-play module.
Poster
Zongjian Li · Bin Lin · Yang Ye · Liuhan Chen · Xinhua Cheng · Shenghai Yuan · Li Yuan

[ ExHall D ]

Abstract
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2× higher throughput and 4× lower memory consumption while maintaining competitive reconstruction quality. Our code and models will be released to inspire further research.
Poster
Zhuoling Li · Hossein Rahmani · Qiuhong Ke · Jun Liu

[ ExHall D ]

Abstract
Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, through theoretical analysis of the mechanisms behind video generation, we identify two key challenges that hinder short-to-long generalization, namely, temporal position ambiguity and information dilution. To address these challenges, we propose LongDiff, a novel training-free method that unlocks the potential of the off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
Poster
Shian Du · Menghan Xia · Chang Liu · Xintao Wang · Jing Wang · Pengfei Wan · Di ZHANG · Xiangyang Ji

[ ExHall D ]

Abstract
Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR.This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called \textit{PatchVSR}, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused byincomplete semantics of patches.Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame.Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512×512 resolution base model, with extremely high efficiency.
Poster
Henrique Morimitsu · Xiaobin Zhu · Roberto M. Cesar Jr · Xiangyang Ji · Xu-Cheng Yin

[ ExHall D ]

Abstract
Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K (7680 x 4320) resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks. The code and dataset have been submitted as …
Poster
Lianxin Xie · csbingbing zheng · Si Wu · Hau San Wong

[ ExHall D ]

Abstract
Blind Face Video Restoration (BFVR) focuses on reconstructing high-quality facial image sequences from degraded video inputs. The main challenge is address unknown degradations, while maintaining temporal consistency across frames. Current blind face restoration methods are primarily designed for images, and directly applying these approaches to BFVR will encounter a significant drop in restoration performance. In this work, we proposed Dynamic Content Prediction with Motion-aware Priors, referred to as DCP-MP. We develop a motion-aware semantic dictionary by encoding the semantic information of high-quality videos into discrete elements, and capturing the motion information in terms of element relationships, which are derived from the dynamic temporal changes within videos. For the purpose of utilizing dictionary to represent the degraded video, we train a temporal-aware element predictor, conditioned on degraded content, to learn the prediction of discrete elements in dictionary. The predicted elements will be refined, conditioned on motion information captured by the motion-aware semantic dictionary, to enhance temporal coherence. To alleviate deviation from the original structure information, we propose a conditional structure feature correction module that corrects the features flowing from the encoder to the generator. Through extensive experiments, we validate the effectiveness of our design components and demonstrate the superior performance of …
Poster
Haoyan Gong · Zhenrong Zhang · Yuzheng Feng · Anh Nguyen · Hongbin Liu

[ ExHall D ]

Abstract
License plate (LP) recognition is crucial in intelligent traffic management systems. However, factors such as long distances and poor camera quality often lead to severe degradation of captured LP images, posing challenges to accurate recognition. The design of License Plate Image Restoration (LPIR) methods frequently relies on synthetic degraded data, which limits their effectiveness on real-world severely degraded LP images. To address this issue, we introduce the first paired LPIR dataset collected in real-world scenarios, named MDLP, including 10,245 pairs of multi-frame severely degraded LP images and their corresponding clear images. To better restore severely degraded LP, we propose a novel Diffusion-based network, called LP-Diff, to tackle real-world LPIR tasks. Our approach incorporates (1) an Inter-frame Cross Attention Module to fuse temporal information across multiple frames, (2) a Texture Enhancement Module to restore texture information in degraded images, and (3) a Dual-Pathway Fusion Module to select effective features from both channel and spatial dimensions. Extensive experiments demonstrate the reliability of our dataset for model training and evaluation. Our proposed LP-Diff consistently outperforms other state-of-the-art image restoration methods on real-world LPIR tasks. Our dataset and code will be released after the paper is accepted to facilitate reproducibility and future research.
Poster
Kenghong Lin · Baoquan Zhang · Demin Yu · Wenzhi Feng · Shidong Chen · Feifan Gao · Xutao Li · Yunming Ye

[ ExHall D ]

Abstract
Precipitation nowcasting involves using current radar observation sequences to predict future radar sequences and determine future precipitation distribution, which is crucial for disaster warning, traffic planning, and agricultural production. Despite numerous advancements, challenges persist in accurately predicting both the location and intensity of precipitation, as these factors are often interdependent, with complex atmospheric dynamics and moisture distribution causing position and intensity changes to be intricately coupled. Inspired by the fact that in the frequency domain, phase variations are shown to correspond to changes in the position of precipitation, while amplitude variations are linked to intensity changes, we propose an amplitude-phase disentanglement model called AlphaPre, which separately learn the position and intensity changes of precipitation. AlphaPre comprises three key components: a phase network, an amplitude network, and an AlphaMixer. The phase network captures positional changes by learning phase variations, while the amplitude network models intensity changes by alternating between the frequency and spatial domains. The AlphaMixer then integrates these components to produce a refined precipitation forecast. Extensive experiments on four datasets demonstrate the effectiveness and superiority of our method over state-of-the-art approaches.
Poster
Yi Liu · Wengen Li · Jihong Guan · Shuigeng Zhou · Yichao Zhang

[ ExHall D ]

Abstract
Cloud removal (CR) remains a challenging task in remote sensing image processing. Though diffusion models (DMs) have achieved promising progress in image generation, their applications to CR are suboptimal, as they employ the vanilla DMs that generate cloudless images from pure noise, ignoring the valuable information in cloudy images. To overcome this drawback, we develop a new CR method EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a well-elucidated design space with a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. We redesign key MRDM modules to boost CR performance, focusing on restructuring the denoiser and redesigning the training process via preconditioning techniques. We also introduce novel deterministic and stochastic samplers. Additionally, to support the multi-temporal CR task, we develop a denoising network for simultaneously denoising sequential images. We evaluate EMRDM on both mono-temporal and multi-temporal CR tasks. Extensive experiments on various datasets show that EMRDM achieves the state-of-the-art (SOTA) performance.
Poster
Jian Zhu · He Wang · Yang Xu · Zebin Wu · Zhihui Wei

[ ExHall D ]

Abstract
Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion.
Poster
Xueyang Wang · Zhixin Zheng · Jiandong Shao · Yule Duan · Liang-Jian Deng

[ ExHall D ]

Abstract
Recent advancements in convolutional neural network (CNN)-based techniques for remote sensing pansharpening have markedly enhanced image quality. However, conventional convolutional modules in these methods have two critical drawbacks. First, the sampling positions in convolution operations are confined to a fixed square window. Second, the number of sampling points is preset and remains unchanged. Given the diverse object sizes in remote sensing images, these rigid parameters lead to suboptimal feature extraction. To overcome these limitations, we introduce an innovative convolutional module, Adaptive Rectangular Convolution (ARConv). ARConv adaptively learns both the height and width of the convolutional kernel and dynamically adjusts the number of sampling points based on the learned scale. This approach enables ARConv to effectively capture scale-specific features of various objects within an image, optimizing kernel sizes and sampling locations. Additionally, we propose ARNet, a network architecture in which ARConv is the primary convolutional module. Extensive evaluations across multiple datasets reveal the superiority of our method in enhancing pansharpening performance over previous techniques. Ablation studies and visualization further confirm the efficacy of ARConv. The source code will be available at github.
Poster
Guanyao Wu · Haoyu Liu · Hongming Fu · Yichuan Peng · Jinyuan Liu · Xin Fan · Risheng Liu

[ ExHall D ]

Abstract
Multi-modality image fusion, particularly infrared and visible image fusion, plays a crucial role in integrating diverse modalities to enhance scene understanding. Early research primarily focused on visual quality, yet challenges remain in preserving fine details, making it difficult to adapt to subsequent tasks. Recent approaches have shifted towards task-specific design, but struggle to achieve the "The Best of Both Worlds'' due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Establish downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge at the feature, pixel, and contrastive semantic levels, thereby removing reliance on the cumbersome SAM model. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency.
Poster
Donggoo Jung · DAEHYUN KIM · Guanghui Wang · Tae Hyun Kim

[ ExHall D ]

Abstract
Image exposure correction enhances images captured under diverse real-world conditions by addressing issues of under- and over-exposure, which can result in the loss of critical details and hinder content recognition. While significant advancements have been made, current methods often fail to achieve optimal feature learning for effective correction.To overcome these challenges, we propose Exposure-slot, a novel framework that integrates a prompt-based slot-in-slot attention mechanism to cluster exposed feature regions and learn exposure-centric features for each cluster. By extending the Slot Attention algorithm with a hierarchical structure, our approach progressively clusters features, enabling precise and region-aware correction. In particular, learnable prompts tailored to exposure characteristics of slots further enhance feature quality, adapting dynamically to varying conditions. Our method delivers superior performance on benchmark datasets, surpassing the current state-of-the-art with a PSNR improvement of over 1.85 dB on the SICE dataset and 0.4 dB on the LCDP dataset, thereby establishing a new benchmark for multi-exposure correction. The source code will be available upon publication.
Poster
Xin Liu · Jie Liu · Jie Tang · Gangshan Wu

[ ExHall D ]

Abstract
Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregate long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
Poster
Yubin Gu · Yuan Meng · Jiayi Ji · Xiaoshuai Sun

[ ExHall D ]

Abstract
Image restoration (IR), a cornerstone of computer vision, has embarked on a new epoch with the advent of deep learning technologies. Recently, numerous CNN and Transformer-based methods have been developed, yet they frequently encounter limitations in global receptive fields and computational efficiency. To mitigate these challenges, recent studies have employed the Selective Space State Model (Mamba), which embodies both attributes. However, due to Mamba's inherent one-dimensional scanning limitations, some approaches have introduced multi-directional scanning to bolster inter-sequence correlations. Despite these enhancements, these methods still struggle with managing local pixel correlations across various directions. Moreover, the recursive computation in Mamba's SSM leads to reduced efficiency. To resolve these issues, we exploit the mathematical congruences between linear attention and SSM within the Mamba to propose a novel model based on a new design structure, ACL. This model integrates linear attention blocks instead of SSM within the Mamba, serving as the core component of encoders/decoders, and aims to preserve a global perspective while boosting computational efficiency. Furthermore, we have designed a simple yet robust local enhancement module with multi-scale dilated convolutions to extract coarse and fine features to improve local detail recovery. Experimental results confirm that our ACL model excels in classical IR …
Poster
Tong Li · Lizhi Wang · Zhiyuan Xu · Lin Zhu · Wanxuan Lu · Hua Huang

[ ExHall D ]

Abstract
Image denoising enhances image quality, serving as a foundational technique across various computational photography applications. The obstacle to clean image acquisition in real scenarios necessitates the development of self-supervised image denoising methods only depending on noisy images, especially a single noisy image. Existing self-supervised image denoising paradigms (Noise2Noise and Noise2Void) rely heavily on information-lossy operations, such as downsampling and masking, culminating in low quality denoising performance.In this paper, we propose a novel self-supervised single image denoising paradigm, Positive2Negative, to break the information-lossy barrier.Our paradigm involves two key steps: Renoised Data Construction (RDC) and Denoised Consistency Supervision (DCS). RDC renoises the predicted denoised image by the predicted noise to construct multiple noisy images, preserving all the information of the original image. DCS ensures consistency across the multiple denoised images, supervising the network to learn robust denoising. Our Positive2Negative paradigm achieves state-of-the-art performance in self-supervised single image denoising with significant speed improvements. The code is released to the public at https://anonymous.4open.science/r/P2N-4C8E.
Poster
Chen Zhao · Zhizhou Chen · Yunzhe Xu · Enxuan Gu · Jian Li · Zili Yi · qian Wang · Jian Yang · Ying Tai

[ ExHall D ]

Abstract
Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed Kolmogorov-Arnold Networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component.
Poster
Muhammad Jamal Jamal · Omid Mohareri

[ ExHall D ]

Abstract
In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.
Poster
MinKyu Lee · Sangeek Hyun · Woojin Jun · Jae-Pil Heo

[ ExHall D ]

Abstract
This work tackles the fidelity objective in the perceptual super-resolution (SR).Specifically, we address the shortcomings of pixel-level Lp loss (Lpix) in the GAN-based SR framework.Since Lpix is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters.However, this work shows that these circumventions fail to address the fundamental factor that induces blurring.Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of Lpix that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship.We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with Lpix. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss (LAESOP), a novel loss function that measures distance in the AE space, instead of the raw pixel space. (AE space indicates the space after the decoder, not the bottleneck.)By simply substituting the conventional Lpix with LAESOP, we can provide effective reconstruction guidance without compromising perceptual quality.Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify the significance of our method in enhancing both fidelity and perceptual quality.
Poster
I-Hsiang (Aaron) Chen · Wei-Ting Chen · Yu-Wei Liu · Yuan-Chun Chiang · Sy-Yen Kuo · Ming-Hsuan Yang

[ ExHall D ]

Abstract
Image restoration aims to recover content from inputs degraded by various factors, such as adverse weather, blur, and noise. Perceptual Image Restoration (PIR) methods improve visual quality but often do not support downstream tasks effectively. On the other hand, Task-oriented Image Restoration (TIR) methods focus on enhancing image utility for high-level vision tasks, sometimes compromising visual quality. This paper introduces UniRestore, a unified image restoration model that bridges the gap between PIR and TIR by using a diffusion prior. The diffusion prior is designed to generate images that align with human visual quality preferences, but these images are often unsuitable for TIR scenarios. To solve this limitation, UniRestore utilizes encoder features from an autoencoder to adapt the diffusion prior to specific tasks. We propose a Complementary Feature Restoration Module (CFRM) to reconstruct degraded encoder features and a Task Feature Adapter (TFA) module to facilitate adaptive feature fusion in the decoder. This design allows UniRestore to optimize images for both human perception and downstream task requirements, addressing discrepancies between visual quality and functional needs. Integrating these modules also enhances UniRestore’s adapability and efficiency across diverse tasks. Extensive expertments demonstrate the superior performance of UniRestore in both PIR and TIR scenarios.
Poster
Leheng Zhang · Weiyi You · Kexuan Shi · Shuhang Gu

[ ExHall D ]

Abstract
Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various …
Poster
Wenhao Shen · Mingliang Zhou · Yu Chen · Xuekai WEI · Yong Feng · Huayan Pu · Weijia Jia

[ ExHall D ]

Abstract
Existing full-reference image quality assessment (FR-IQA) methods often fail to capture the complex causal mechanisms that underlie human perceptual responses to image distortions, limiting their ability to generalize across diverse scenarios. In this paper, we propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions. First, we explore the causal effects of deep features on perception and integrate causal reasoning with feature comparison, constructing a model that effectively handles complex distortion types across different IQA scenarios. Second, the analysis of the perceptual causal correlations of our proposed method is independent of the backbone architecture and thus can be applied to a variety of deep networks. Through abductive counterfactual experiments, we validate the proposed causal relationships, confirming the model's superior perceptual relevance and interpretability of quality scores. The experimental results demonstrate the robustness and effectiveness of the method, providing competitive quality predictions across multiple benchmarks. The source code is available at https://anonymous.4open.science/r/DeepCausalQuality-25BC.
Poster
Chen Liao · Yan Shen · Dan Li · Zhongli Wang

[ ExHall D ]

Abstract
Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct …
Poster
Zhiyuan Chen · Keyi Li · Yifan Jia · Le Ye · Yufei Ma

[ ExHall D ]

Abstract
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. For 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase.
Poster
Ping Chen · Xingpeng Zhang · Zhaoxiang Liu · Huan Hu · Xiang Liu · Kai Wang · Min Wang · Yanlin Qian · Shiguo Lian

[ ExHall D ]

Abstract
In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and quality. Drawing on Denoising Diffusion Implicit Models (DDIM) and insights from graph theory, our model, termed the Shortest Path Diffusion Model (ShortDF), treats the denoising process as a shortest-path problem aimed at minimizing reconstruction error. By optimizing the initial residuals, we improve the efficiency of the reverse diffusion process and the quality of the generated samples. Extensive experiments on multiple standard benchmarks demonstrate that ShortDF significantly reduces diffusion time (or steps) while enhancing the visual fidelity of generated samples compared to prior methods. This work, we suppose, paves the way for interactive diffusion-based applications and establishes a foundation for rapid data generation.
Poster
Kendong Liu · Zhiyu Zhu · Hui LIU · Junhui Hou

[ ExHall D ]

Abstract
We present Acc3D to tackle the challenge of accelerating the diffusion process for generating 3D models from single images. To derive accurate reconstruction through few-step inference, we emphasize the critical issue as the modeling of the score function at the endpoints (states of the random noise). To tackle such an issue, we propose edge consistency, i.e., consistent predictions across the low signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we introduce an adversarial augmentation strategy to further enrich generation detail. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments show that our Acc3D not only achieves over a 20× increase in computational efficiency but also yields notable quality improvements, compared with state-of-the-art methods. Project webpage: https://acc3d-object.github.io/
Poster
Fanhu Zeng · Hao Tang · Yihua Shao · Siyu Chen · Ling Shao · Yan Wang

[ ExHall D ]

Abstract
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we systematically analyze the advantages of SSMs for better integration and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code will be made publicly available.
Poster
Jinchang Xu · Shaokang Wang · Jintao Chen · Zhe Li · Peidong Jia · Fei Zhao · Guoqing Xiang · Zhijian Hao · Shanghang Zhang · Xiaodong Xie

[ ExHall D ]

Abstract
Leveraging the generative power of diffusion models, generative image compression has achieved impressive perceptual fidelity even at extremely low bitrates. However, current methods often neglect the non-uniform complexity of images, limiting their ability to balance global perceptual quality with local texture consistency and to allocate coding resources efficiently. To address this, we introduce the Map-guided Masking Realism Image Diffusion Codec(MRIDC), designed to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. On the encoding side, we propose a Map-guided Latent Masking(MLM) module, which selectively masks elements in the latent space based on prior information, allowing adaptive resource allocation aligned with image complexity. On the decoding side, masked latents are completed using the Bidirectional Prediction Controllable Generation(BPCG) module, which guides the constrained generation process within the diffusion model to reconstruct the image. Experimental results show that MRIDC achieves state-of-the-art perceptual compression quality at extremely low bitrates, effectively preserving feature consistency in key regions and advancing the rate-distortion-perception performance curve, establishing new benchmarks in balancing compression efficiency with visual fidelity.
Poster
Emiel Hoogeboom · Thomas Mensink · Jonathan Heek · Kay Lamerigts · Ruiqi Gao · Tim Salimans

[ ExHall D ]

Abstract
Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600.We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss (Kingma and Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at high resolution with fewer parameters, rather than using more parameters but at a lower resolution. When combining these three steps with recently proposed tricks like guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).
Poster
Haoran You · Connelly Barnes · Yuqian Zhou · Yan Kang · Zhenbang Du · Wei Zhou · Lingzhi Zhang · Yotam Nitzan · Xiaoyang Liu · Zhe Lin · Eli Shechtman · Sohrab Amirghodsi · Yingyan (Celine) Lin

[ ExHall D ]

Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One key efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in Mixture-of-Depths (MoD) efficient DiT models. Specifically, DiffRatio-MoD integrates three features:(1) A token-level routing scheme where each DiT layer includes a router that is jointly fine-tuned with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the …
Poster
Haipeng Fang · Sheng Tang · Juan Cao · Enshuo Zhang · Fan Tang · Tong-yee Lee

[ ExHall D ]

Abstract
Diffusion transformers have shown exceptional performance in visual generation but are accompanied by high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, which achieves 1.55 times acceleration with negligible impact on image quality.
Poster
Longquan Dai · He Wang · Jinhui Tang

[ ExHall D ]

Abstract
In training-free conditional generative tasks, diffusion models utilize differentiable loss functions to steer the generative reverse process, necessitating modifications to sampling algorithms like DDPM and DDIM. However, such adjustments likely reduce flexibility and reliability. In this paper, we propose NoiseCtrl, a sampling-algorithm-agnostic technique for controlled image generation. Essentially, diffusion models generate denoised results zt1 by adding a predicted mean μt with random noise ϵt. NoiseCtrl specifically adjusts the random noise while leaving the underlying sampling algorithms unchanged. At each step t, NoiseCtrl converts the unconditional Gaussian noise into conditional noise ϵt by substituting the isotropic Gaussian distribution with the von Mises–Fisher distribution. This substitution introduces a directional focus while preserving the randomness required for conditional image generation. Thanks to this non-intrusive design, NoiseCtrl is straightforward to integrate and has been extensively validated through experiments, demonstrating its adaptability for different diffusion algorithms and superior performance across various conditional generation tasks.
Poster
Yunpeng Liu · Boxiao Liu · Yi Zhang · Xingzhong Hou · Guanglu Song · Yu Liu · Haihang You

[ ExHall D ]

Abstract
Significant advances have been made in the sampling efficiency of diffusion and flow matching models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at a later timestep. However, we found that the knowledge discrepancy between student and teacher varies significantly across different timesteps, leading to suboptimal performance in CD.To address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the knowledge discrepancy across timesteps. Specifically, we regard the distillation process at each timestep as a curriculum and introduce a metric based on the Peak Signal-to-Noise Ratio (PSNR) to quantify the knowledge discrepancy of this curriculum, then ensure that the curriculum maintains consistent knowledge discrepancy across different timesteps by having the teacher model iterate more steps when the noise intensity is low.Our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet 64x64.Moreover, we have extended our method to large-scale text-to-image models and confirmed that it generalizes well to both diffusion models (Stable Diffusion XL) and flow matching models (Stable Diffusion 3). The generated samples demonstrate improved image-text alignment and semantic structure since CCM enlarges the distillation step at …
Poster
Huiyang Shao · Xin Xia · Yuhong Yang · Ren Yuxi · XING WANG · Xuefeng Xiao

[ ExHall D ]

Abstract
Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method maximizes the reduction of sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.
Poster
Pingyu Wu · Kai Zhu · Yu Liu · Liming Zhao · Wei Zhai · Yang Cao · Zheng-Jun Zha

[ ExHall D ]

Abstract
Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most existing video VAEs inflate a pre-trained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in the above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks …
Poster
Maosen Zhao · Pengtao Chen · Chong Yu · Yan Wen · Xudong Tan · Tao Chen

[ ExHall D ]

Abstract
Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization. Our code will be publicly available soon.
Poster
Gongfan Fang · Kunjun Li · Xinyin Ma · Xinchao Wang

[ ExHall D ]

Abstract
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2× speedup with an FID score of 2.86, outperforming competitors with comparable efficiency.
Poster
Yuanyang Yin · Yaqi Zhao · Mingwu Zheng · Ke Lin · Jiarong Ou · Rui Chen · Victor Shea-Jay Huang · Jiahao Wang · Xin Tao · Pengfei Wan · Di ZHANG · Baoqun Yin · Wentao Zhang · Kun Gai

[ ExHall D ]

Abstract
Achieving optimal performance of video diffusion transformers within given data and compute budgets is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size—two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among test loss, any model size, and training budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
Poster
Kaibo Zhao · Liang Bao · Yufei Li · Xu Su · Ke Zhang · Xiaotian Qiao

[ ExHall D ]

Abstract
Image vectorization aims to convert raster images to vector ones, allowing for easy scaling and editing.Existing works mainly rely on preset parameters (i.e., a fixed number of paths and control points), ignoring the complexity of the image and posing significant challenges to practical applications.We demonstrate that such an assumption is often incorrect, as the preset paths or control points may be neither essential nor enough to achieve accurate and editable vectorization results.Based on this key insight, in this paper, we propose an efficient image vectorization method with adaptive parametrization, where the paths and control points can be adjusted dynamically based on the complexity of the input raster image.In particular, we first decompose the input raster image into a set of pure-colored layers that are aligned with human perception.For each layer with varying shape complexity, we propose a novel allocation mechanism to adaptively adjust the control point distribution.We further adopt a differentiable rendering process to compose and optimize the shape and color parameters of each layer iteratively.Extensive experiments demonstrate that our method outperforms the baselines qualitatively and quantitatively, in terms of computational efficiency, vectorization accuracy, and editing flexibility.
Poster
Mohd Hozaifa Khan · Ravi Kiran Sarvadevabhatla

[ ExHall D ]

Abstract
We introduce **Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions** in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over **20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks**. We introduce **multimodal foundational agents** with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes **800 human-agent sessions** for benchmarking the agents. We introduce **novel metrics** to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing **a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents**.
Poster
Yihao Meng · Hao Ouyang · Hanlin Wang · Qiuyu Wang · Wen Wang · Ka Leong Cheng · Zhiheng Liu · Yujun Shen · Huamin Qu

[ ExHall D ]

Abstract
The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, Anidoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. We will make the model public to facility the community.
Poster
Guy Yariv · Yuval Kirstain · Amit Zohar · Shelly Sheynin · Yaniv Taigman · Yossi Adi · Sagie Benaim · Adam Polyak

[ ExHall D ]

Abstract
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios.To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce SA-V-128, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate …
Poster
Tongtong Su · Chengyu Wang · Bingyan Liu · Jun Huang · Dongming Lu

[ ExHall D ]

Abstract
In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce \emph{EVS}, a training-free \underline{E}ncapsulated \underline{V}ideo \underline{S}ynthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, \emph{EVS} successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches.Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time.~\footnote{Source codes will be released upon paper acceptance.}
Poster
Yeongmin Kim · Sotiris Anagnostidis · Yuming Du · Edgar Schoenfeld · Jonas Kohler · Markos Georgopoulos · Albert Pumarola · Ali Thabet · Artsiom Sanakoyeu

[ ExHall D ]

Abstract
Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a 5× reduction in FID degradation compared …
Poster
Diljeet Jagpal · Xi Chen · Vinay P. Namboodiri

[ ExHall D ]

Abstract
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image-generation models, which limit their adaptability and scalability.In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, Our approach can ensure appropriate control between coherence and variance for the frames.Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model’s superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
Poster
Luozhou Wang · Yijun Li · ZhiFei Chen · Jui-Hsien Wang · Zhifei Zhang · He Zhang · Zhe Lin · Ying-Cong Chen

[ ExHall D ]

Abstract
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Poster
Xiang Gao · Shuai Yang · Jiaying Liu

[ ExHall D ]

Abstract
Optical illusion hidden picture is an interesting visual perceptual phenomenon where an image is cleverly integrated into another picture in a way that is not immediately obvious to the viewer. Established on off-the-shelf text-to-image diffusion model, we propose a novel \textbf{P}hase-\textbf{T}ransferred \textbf{Diffusion} Model (PTDiffusion) for hidden art syntheses. PTDiffusion embeds an input reference image into arbitrary scenes that are faithful to text prompts, while exhibiting hidden visual cues of the reference image. At the heart of our method is a plug-and-play phase transfer mechanism that dynamically and progressively transplants diffusion features' phase spectrum from the denoising process to reconstruct the reference image into the one to sample the illusion picture, realizing harmonious fusion of the reference structural information and the target semantic information. Furthermore, we propose an asynchronous phase transfer mechanism to flexibly control the degree of hidden image discernability. Our method is training-free, all while substantially outperforming related methods in image quality, text fidelity, visual discernibility, and contextual naturalness for illusion picture synthesis, as fully demonstrated by extensive qualitative and quantitative experiments.
Poster
Hyunsoo Kim · Donghyun Kim · Suhyun Kim

[ ExHall D ]

Abstract
How can we generate an image B that satisfies A:A::B:B, given the input images A,A and B?Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (\eg InstructPix2Pix. Inpainting models) rather than general diffusion models (\eg Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A and applies it to B to generate a plausible B. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate …
Poster
ruojun xu · Weijie Xi · Xiaodi Wang · Yongbo Mao · Zach Cheng

[ ExHall D ]

Abstract
Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.
Poster
Yang Zhou · Xu Gao · Zichong Chen · Hui Huang

[ ExHall D ]

Abstract
Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis.
Poster
Jihun Park · Jongmin Gim · Kyoungmin Lee · Seunghun Lee · Sunghoon Im

[ ExHall D ]

Abstract
We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs.The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions.Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image’s background. This loss is applied to dynamically identified background areas.Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.
Poster
Suho Ryu · Kihyun Kim · Eugene Baek · Dongsoo Shin · Joonseok Lee

[ ExHall D ]

Abstract
A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.
Poster
Weicheng Wang · Guoli Jia · Zhongqi Zhang · Liang Lin · Jufeng Yang

[ ExHall D ]

Abstract
Diffusion models pre-trained on large-scale paired image-text data achieve significant success in image editing. To convey more fine-grained visual details, subject-driven editing integrates subjects in user-provided reference images into existing scenes. However, it is challenging to obtain photorealistic results, which simulate contextual interactions, such as reflections, illumination, and shadows, induced by merging the target object into the source image. To address this issue, we propose PS-Diffusion, which ensures realistic and consistent object-scene blending while maintaining invariance of subject appearance during editing. Specifically, we first divide the contextual interactions into those occurring in the foreground and the background areas. The effect of the former is estimated through intrinsic image decomposition, and the region of the latter is predicted in an additional background effect control branch. Moreover, we propose an effect attention module to disentangle the learning processes of interaction and subject, alleviating confusion between them. Additionally, we introduce a synthesized dataset, Replace-5K, consisting of 5,000 image pairs with invariant subject and contextual interactions via 3D rendering. Extensive quantitative and qualitative experiments on our dataset and two real-world datasets demonstrate that our method achieves state-of-the-art performance. The source code is provided in the supplementary materials and will be publicly available.
Poster
Navve Wasserman · Noam Rotstein · Roy Ganz · Ron Kimmel

[ ExHall D ]

Abstract
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. Our quantitative and qualitative results show that the trained model surpasses existing models in both object addition and general editing tasks. To propel future research, we will release the dataset alongside the trained models.
Poster
jun huang · Ting Liu · Yihang Wu · Xiaochao Qu · Luoqi Liu · Xiaolin Hu

[ ExHall D ]

Abstract
Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce the MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million masks-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present the combination of self-attention mechanisms and a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.
Poster
Yizhe Tang · Zhimin Sun · Yuzhen Du · Ran Yi · Guangben Lu · Teng Hu · LUYING LI · Lizhuang Ma · FangYuan Zou

[ ExHall D ]

Abstract
Image inpainting aims to fill the missing region of an image.Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided.Existing background inpainting methods typically strictly preserve the subject's original position from the source image,resulting in inconsistencies between the subject and the generated background.To address this challenge, we propose a new task, the Text-Guided Subject-Position Variable Background Inpainting'', which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject andthe inpainted background, and propose the Adaptive Transformation Agent (ATA) for this task.Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position.Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information.Thirdly, we equip ATA with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed.Extensive comparative experiments validate the effectiveness of our ATA approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but …
Poster
Bolin Lai · Felix Juefei-Xu · Miao Liu · Xiaoliang Dai · Nikhil Mehta · Chenguang Zhu · Zeyi Huang · James Rehg · Sangmin Lee · Ning Zhang · Tong Xiao

[ ExHall D ]

Abstract
Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed InstaManip, that can instantly learn a new image manipulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (>=19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
Poster
Pu Cao · Feng Zhou · Lu Yang · TianruiHuang · Qing Song

[ ExHall D ]

Abstract
In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation.Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities.To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective.We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance.Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion …
Poster
Qihan Huang · Weilong Dai · Jinlong Liu · Wanggui He · Hao Jiang · Mingli Song · Jie Song

[ ExHall D ]

Abstract
Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision models with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance …
Poster
Dong Liang · Jinyuan Jia · Yuhao Liu · Zhanghan Ke · Hongbo Fu · Rynson W.H. Lau

[ ExHall D ]

Abstract
Recent advancements in diffusion models have significantly enhanced the performance of text-to-image models in image synthesis. To enable control over the the spatial locations of the generated objects,diffusion-based methods typically utilizeobject layout as an auxiliary input. However, we observe that this approach treats all objects as being on the same layer and neglect their visibility order, leading to the synthesis of overlapping objects with incorrect occlusions.To address this limitation, we introduce in this paper a new training-free framework that considers object visibility order explicitly and allows users to place overlapping objects in a stack of layers. Our framework consists of two visibility-based designs. First, we propose a novel Sequential Denoising Process (SDP) to divide the whole image generation into multiple stages for different objects, each stage primarily focuses on an object. Second, we propose a novel Visibility-Order-Aware (VOA) Loss to transform the layout and occlusion constraints into an attention map optimization process to improve the accuracy of synthesizing object occlusions in complex scenes. By merging these two novel components, our framework, dubbed VODiff, enables the generation of photorealistic images that satisfy user-specified spatial constraints and object occlusion relationships. In addition, we introduce VOBench, a diverse benchmark dataset containing 200 curated …
Poster
Yingying Deng · Xiangyu He · Fan Tang · Weiming Dong

[ ExHall D ]

Abstract
The customization of multiple attributes has gained increasing popularity with the rising demand for personalized content creation. Despite promising empirical results, the contextual coherence between different attributes has been largely overlooked. In this paper, we argue that subsequent attributes should follow the multivariable conditional distribution introduced by former attributes creation. In light of this, we reformulate multi-attribute creation from a conditional probability theory perspective and tackle the challenging zero-shot setting. By explicitly modeling the dependencies between attributes, we further enhance the coherence of generated images across diverse attribute combinations. Furthermore, we identify connections between multi-attribute customization and multi-task learning, effectively addressing the high computing cost encountered in multi-attribute synthesis. Extensive experiments demonstrate that Z-Magic outperforms existing models in zero-shot image generation, with broad implications for AI-driven design and creative applications.
Poster
Jian Han · Jinlai Liu · Yi Jiang · Bin Yan · Yuqi Zhang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu

[ ExHall D ]

Abstract
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity refactors visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary classifier and bitwise self-correction mechanism. By theoretically expanding the tokenizer vocabulary size to infinity in Transformer, our method significantly unleashes powerful scaling capabilities to infinity compared to vanilla VAR. Extensive experiments indicate Infinity outperforms AutoRegressive Text-to-Image models by large margins, matches or surpasses leading diffusion models. Without extra optimization, Infinity generates a 1024×1024 image in 0.8s, 2.6× faster than SD3-Medium, making it the fastest Text-to-Image model. Models and codes will be released to promote the further exploration of Infinity for visual generation.
Poster
Woojung Han · Yeonkyung Lee · Chanyoung Kim · Kwanghyun Park · Seong Jae Hwang

[ ExHall D ]

Abstract
Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges such as "missing objects'' and "mismatched attributes,'' another critical issue of "mislocated objects'' remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a custom Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis. The source code will be publicly released.
Poster
Jiapeng Zhu · Ceyuan Yang · Kecheng Zheng · Yinghao Xu · Zifan Shi · Yifei Zhang · Qifeng Chen · Yujun Shen

[ ExHall D ]

Abstract
Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling out of grace with the task of text-conditioned image synthesis. Sparsely activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited resources. Inspired by this, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to adaptively select the most suitable expert for each feature point. We adopt a two-stage training strategy, which first learns a base model at 64×64 resolution followed by an upsampler to produce 512×512 images. Trained with only public data, our approach encouragingly closes the performance gap between GANs and industry-level diffusion models, maintaining a fast inference speed. We will release the code and checkpoints to facilitate the community for more comprehensive studies of GANs.
Poster
Qingyu Shi · Lu Qi · Jianzong Wu · Jinbin Bai · Jingbo Wang · Yunhai Tong · Xiangtai Li

[ ExHall D ]

Abstract
Customized image generation is essential for delivering personalized content based on user-provided prompts, enabling large-scale text-to-image diffusion models to better align with individual needs. However, existing models often neglect the relationships between customized objects in generated images. In contrast, this work addresses this gap by focusing on relation-aware customized image generation, which seeks to preserve the identities from image prompts while maintaining the predicate relations specified in text prompts. Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges—generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on our proposed benchmarks demonstrate the superiority of DreamRelation in generating precise relations while preserving object identities across a diverse set …
Poster
Kaiwen Zha · Lijun Yu · Alireza Fathi · David A. Ross · Cordelia Schmid · Dina Katabi · Xiuye Gu

[ ExHall D ]

Abstract
Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focusing on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2\% and 48.1\% on ImageNet 256×256 and 512×512 benchmarks respectively, across varying number of tokens. These tokenization improvements consistently translate to 16.3\% and 34.3\% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve 93.5× inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 …
Poster
Lifu Wang · Daqing Liu · Xinchen Liu · Xiaodong He

[ ExHall D ]

Abstract
Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.
Poster
Shengqu Cai · Eric Ryan Chan · Yunzhi Zhang · Leonidas Guibas · Jiajun Wu · Gordon Wetzstein

[ ExHall D ]

Abstract
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
Poster
Fu Feng · Yucheng Xie · Xu Yang · Jing Wang · Xin Geng

[ ExHall D ]

Abstract
'Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-domain concepts like ''a blue banana'', they struggle with generating combinatorial objects such as ''a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth of ''creative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output---a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ''creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that \texttt{<CreTok>} enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training (4s vs. BASS's 2400s per image), achieving state-of-the-art performance with improved text-image alignment (0.03 in VQAScore) and higher human preference ratings (0.009 in PickScore and 0.169 in ImageReward). Further evaluations with GPT-4o …</cretok></cretok>
Poster
Shulei Wang · w l · Hai Huang · Hanting Wang · Sihang Cai · WenKang Han · Tao Jin · Jingyuan Chen · Jiacheng Sun · Jieming Zhu · Zhou Zhao

[ ExHall D ]

Abstract
We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.
Poster
Kyungmin Lee · Xiaohang Li · Qifei Wang · Junfeng He · Junjie Ke · Ming-Hsuan Yang · Irfan Essa · Jinwoo Shin · Feng Yang · Yinxiao Li

[ ExHall D ]

Abstract
Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.
Poster
Keyu Tu · Mengqi Huang · Zhuowei Chen · Zhendong Mao

[ ExHall D ]

Abstract
Large-scale text-to-image models evolve rapidly in size and architecture. The existing adapters struggle to keep pace with these models, requiring extensive retraining. This paper proposes a novel adapter transfer framework, A4A (Adapter for Adapter), which uses an all-for-all mapping approach to seamlessly transfer attention-based adapters across different model architectures (e.g., U-Net to transformer). The framework consists of Coupling Space Projection and Upgraded Space Mapping. During Coupling Space Projection, all attention features of the pre-trained adapter are collected to capture the complete coupling relationship with the base model and then projected into the unified space. Randomly initialized learnable features in the upgraded model are introduced to connect the unified space and upgraded space. By integrating the reference features through the attention mechanism and aligning them with the upgraded architecture, the learnable features bridge the discrepancies between the models. Experimental results on personalized image generation tasks demonstrate that A4A outperforms previous methods in transferring adapters while being the first to achieve adapter transfer across model architectures.
Poster
Xiaoying Xing · Avinab Saha · Junfeng He · Susan Hao · Paul Vicol · Moonkyung Ryu · Gang Li · Sahil Singla · Sarah Young · Yinxiao Li · Feng Yang · Deepak Ramachandran

[ ExHall D ]

Abstract
Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image.
Poster
Leigang Qu · Haochuan Li · Wenjie Wang · Xiang Liu · Juncheng Li · Liqiang Nie · Tat-seng Chua

[ ExHall D ]

Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation.However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (**SILMM**) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging.To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30\% on T2I-CompBench++ and around 20\% on DPG-Bench.
Poster
Chongjian GE · Chenfeng Xu · Yuanfeng Ji · Chensheng Peng · Masayoshi Tomizuka · Ping Luo · Mingyu Ding · Varun Jampani · Wei Zhan

[ ExHall D ]

Abstract
Recent breakthroughs in text-guided image generation have significantly advanced the field of 3D generation. While generating a single high-quality 3D object is now feasible, generating multiple objects with reasonable interactions within a 3D space, a.k.a. compositional 3D generation, presents substantial challenges. This paper introduces CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. To achieve this goal, two core designs are proposed: (1) *3D Gaussians Initialization with 2D compositionality*: We transfer the well-established 2D compositionality to initialize the Gaussian parameters on an entity-by-entity basis, ensuring both consistent 3D priors for each entity and reasonable interactions among multiple entities; (2) *Dynamic Optimization*: We propose a dynamic strategy to optimize 3D Gaussians using Score Distillation Sampling (SDS) loss. CompGS first automatically decomposes 3D Gaussians into distinct entity parts, enabling optimization at both the entity and composition levels. Additionally, CompGS optimizes across objects of varying scales by dynamically adjusting the spatial parameters of each entity, enhancing the generation of fine-grained details, particularly in smaller entities. Qualitative comparisons and quantitative evaluations on T3Bench demonstrate the effectiveness of CompGS in generating compositional 3D objects with superior image quality and semantic alignment over existing methods. CompGS can also …
Poster
Yiming Qin · Zhu Xu · Yang Liu

[ ExHall D ]

Abstract
In recent years, text-to-3D generation has made great progress and can generate many exquisite 3D objects. However, due to the weakness of text encoder in processing long text, the task of text-to-3D generation with complex attributes has always encountered great difficulties. In particular, when there are serious occlusion relationships between these complex attributes, the results will be worse. Therefore, we propose a new method called Hierarchical-Chain-of-Generation (HCoG), which needs no manual efforts, utilizing large language model to decompose complex target objects into a hierarchical generation chain, so that each part can be better generated. Furthermore, for each split text, SAM automatically find the corresponding region and optimize 3D Gaussian kernels in this region by a controllable way. In addition, to generate new parts in the hierarchical chain, we need to preserve previous parts and optimize new parts. Therefore, we propose the Label Elimination to ensure new parts will not attach to the surface of the previous parts and change them. Experiment demonstrates that HCoG is an end-to-end automatic framework for advanced complex attributes text-to-3D generation while effectively handling situations where there are a lot of occlusions between attributes and ensuring high quality of the results.
Poster
Yidi Li · Jun Xiao · Zhengda Lu · Yiqun Wang · Haiyong Jiang

[ ExHall D ]

Abstract
This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.
Poster
Chen Liang · Lianghua Huang · Jingwu Fang · Huanzhang Dou · Wei Wang · Zhi-Fan Wu · Yupeng Shi · Junge Zhang · Xin Zhao · Yu Liu

[ ExHall D ]

Abstract
Recent advancements in image generation models enable the creation of high-quality images and targeted modifications based on textual instructions. Some models even support multimodal complex guidance and demonstrate robust task generalization capabilities. However, they still fall short of meeting the nuanced, professional demands of designers. To bridge this gap, we introduce IDEA-Bench, a comprehensive benchmark designed to advance image generation models toward applications with robust task generalization. IDEA-Bench comprises 97 professional image generation tasks and 266 specific cases, categorized into five major types based on the current capabilities of existing models. Furthermore, we provide a representative subset of 18 tasks with enhanced evaluation criteria to facilitate more nuanced and reliable evaluations using Multimodal Large Language Models (MLLMs). By assessing models' ability to comprehend and execute novel, complex tasks, IDEA-Bench paves the way toward the development of generative models with autonomous and versatile visual generation capabilities.
Poster
Shalini Maiti · Lourdes Agapito · Filippos Kokkinos

[ ExHall D ]

Abstract
The rapid advancements in text-to-3D generation necessitate robust and scalable evaluation metrics that align closely with human judgment—a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality—by analyzing 3D surface normals—without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences.Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, establishing itself as a comprehensive and accessible benchmark for future research in text-to-3D generation. To support and encourage further research in this field, we will release both our code and benchmark, establishing Gen3DEval as a comprehensive and accessible tool for text-to-3D evaluation.
Poster
Jiahao Li · Weijian Ma · Xueyang Li · Yunzhong Lou · Guichun Zhou · Xiangdong Zhou

[ ExHall D ]

Abstract
Large Language Models (LLMs) have achieved significant success recently, leading to growing enthusiasm to expand their generative capabilities from general text to specialized domains. This paper discusses generating parametric sequences of computer-aided design (CAD) models via LLMs. This serves as an entry point for 3D shape generation by LLMs, as CAD model parameters directly map to shapes in 3D space. This task is non-trivial despite LLMs' strong generative abilities, as they have neither encountered parametric sequences during pretraining nor can they directly perceive 3D shapes. Therefore, we propose CAD-Llama, a paradigm empowering pretrained LLMs for parametric 3D CAD model generation. Specifically, we introduce a code-like format to unify parametric 3D CAD command sequences and a hierarchical annotation pipeline to convert intricate parametric 3D CAD shapes into Python-like pseudo-code, an LLM-friendly text format. Additionally, we propose adaptive pre-training on Structured Parametric CAD Code (SPCC) and fine-tuning with CAD-related instructions to imbue LLMs with spatial information conveyed in parametric sequences. Experimental results show that our framework significantly outperforms previous autoregressive methods and prevailing LLM baselines.
Poster
Yunqi Gu · Ian Huang · Jihyeon Je · Guandao Yang · Leonidas Guibas

[ ExHall D ]

Abstract
3D graphics editing is a crucial component in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating the process is challenging because graphical editing requires performing different tasks, each requiring distinct skill sets. Recently, multi-modal foundation models have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and real-world editing complexity. In this work, we present BlenderGym, a benchmark designed to systematically evaluate foundational model systems for 3D graphics editing with tasks capturing the various aspects of 3D editing and fixed ground-truth for evaluation. We evaluate closed- and open-source VLMs with BlenderGym and observe that even the state-of-the-art VLMs struggle with tasks relatively easily for a novice Blender user. Enabled by BlenderGym, we study how inference scaling techniques impact graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through scaling, complementing recent insights on scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by …
Poster
Zhipeng Xu · De Cheng · XINYANG JIANG · Nannan Wang · Dongsheng Li · Xinbo Gao

[ ExHall D ]

Abstract
Single domain generalization (SDG) aims to learn a robust model, which could perform well on many unseen domains while there is only one single domain available for training. One of the promising directions for achieving single-domain generalization is to generate out-of-domain (OOD) training data through data augmentation or image generation. Given the rapid advancements in AI-generated content (AIGC), this paper is the first to propose leveraging powerful pre-trained text-to-image (T2I) foundation models to create the training data. However, manually designing textual prompts to generate images for all possible domains is often impractical, and some domain characteristics may be too abstract to describe with words. To address these challenges, we propose a novel Progressive Adversarial Prompt Tuning (PAPT) framework for pre-trained diffusion models. Instead of relying on static textual domains, our approach learns two sets of abstract prompts as conditions for the diffusion model: one that captures domain-invariant category information and another that models domain-specific styles. This adversarial learning mechanism enables the T2I model to generate images in various domain styles while preserving key categorical features. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art single-domain generalization approaches. Code is available in the supplementary materials.
Poster
Byung Hyun Lee · Sungjin Lim · Se Young Chun

[ ExHall D ]

Abstract
Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the …
Poster
Dohyun Kim · Sehwan Park · GeonHee Han · Seung Wook Kim · Paul Hongsuck Seo

[ ExHall D ]

Abstract
Diffusion models have emerged as a cornerstone of generative modeling, capable of producing high-quality images through a progressive denoising process. However, their remarkable performance comes with substantial computational costs, driven by large model sizes and the need for multiple sampling steps. Knowledge distillation, a popular approach for model compression, transfers knowledge from a complex teacher model to a simpler student model. While extensively studied for recognition tasks, its application to diffusion models—especially for generating unseen concepts absent from training images—remains relatively unexplored. In this work, we propose a novel approach called random conditioning, which pairs noised images with randomly chosen text conditions to enable efficient, image-free knowledge distillation. By leveraging random conditioning, we show that it is possible to generate unseen concepts not included in the training data. When applied to conditional diffusion model distillation, This method enables the student model to effectively explore the condition space, leading to notable performance gains. Our approach facilitates the resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and practical applications.
Poster
Reza Shirkavand · Peiran Yu · Shangqian Gao · Gowthami Somepalli · Tom Goldstein · Heng Huang

[ ExHall D ]

Abstract
Recent advances in diffusion generative models have yielded remarkable progress. While the quality of generated content continues to improve, these models have grown considerably in size and complexity. This increasing computational burden poses significant challenges, particularly in resource-constrained deployment scenarios such as mobile devices. The combination of model pruning and knowledge distillation has emerged as a promising solution to reduce computational demands while preserving generation quality. However, this technique inadvertently propogates undesirable behaviors, including the generation of copyrighted content and unsafe concepts, even when such instances are absent from the fine-tuning dataset.In this paper, we propose a novel bilevel optimization framework for pruned diffusion models that consolidates the fine-tuning and unlearning processes into a unified phase. Our approach maintains the principal advantages of distillation—namely, efficient convergence and style transfer capabilities—while selectively suppressing the generation of unwanted content. This plug-in framework is compatible with various pruning and concept unlearning methods, facilitating efficient, safe deployment of diffusion models in controlled environments.
Poster
Jisu Nam · Soowon Son · Zhan Xu · Jing Shi · Difan Liu · Feng Liu · Seungryong Kim · Yang Zhou

[ ExHall D ]

Abstract
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K—a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks. The code and pre-trained weights will be publicly …
Poster
Alexandra Gomez-Villa · Kai Wang · C.Alejandro Parraga · Bartłomiej Twardowski · Jesus Malo · Javier Vazquez-Corral · Joost van de Weijer

[ ExHall D ]

Abstract
Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions.This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.
Poster
Zhenguang Liu · Chao Shuai · Shaojing Fan · Ziping Dong · Jinwu Hu · Zhongjie Ba · Kui Ren

[ ExHall D ]

Abstract
Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.
Poster
Haoyu Chen · Yunqiao Yang · Nan Zhong · Kede Ma

[ ExHall D ]

Abstract
Hiding data in deep neural networks (DNNs) has achieved remarkable successes, including both discriminative and generative models. Yet, the potential for hiding images in diffusion models remains underdeveloped. Existing approaches fall short in extracting fidelity, secrecy, and efficiency. In particular, the intensive computational demands of the hiding process, coupled with the slow extraction due to multiple denoising stages, make these methods impractical for resource-limited environments. To address these challenges, we propose hiding images at a specific denoising stage in diffusion models by modifying the learned score functions. We also introduce a parameter-efficient fine-tuning (PEFT) approach that combines parameter selection with a variant of low-rank adaptation (LoRA) to boost secrecy and hiding efficiency. Comprehensive experiments demonstrate the effectiveness of our proposed method.
Poster
Jan Dubiński · Antoni Kowalczuk · Franziska Boenisch · Adam Dziedzic

[ ExHall D ]

Abstract
Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data …
Poster
Fabrizio Guillaro · Giada Zingarini · Ben Usman · Avneesh Sud · Davide Cozzolino · Luisa Verdoliva

[ ExHall D ]

Abstract
Successful forensic detectors can produce excellent results in supervised learning benchmarks but struggle to transfer to real-world applications. We believe this limitation is largely due to inadequate training data quality. While most research focuses on developing new algorithms, less attention is given to training data selection, despite evidence that performance can be strongly impacted by spurious correlations such as content, format, or resolution. A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases. To this end, we propose a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. This ensures semantic alignment between real and fake images, allowing any differences to stem solely from the subtle artifacts introduced by AI generation. Through content-based augmentation, we show significant improvements in both generalization and robustness over state-of-the-art detectors and more calibrated results across 27 different generative models, including recent releases, like FLUX and Stable Diffusion 3.5. Our findings emphasize the importance of a careful dataset curation, highlighting the need for further research in dataset design. Code and data will be publicly available.
Poster
Antonio Andrea Gargiulo · Donato Crisostomi · Maria Sofia Bucarelli · Simone Scardapane · Fabrizio Silvestri · Emanuele Rodolà

[ ExHall D ]

Abstract
Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural information and is susceptible to task interference. In this paper, we study task vectors at the layer level, focusing on task layer matrices and their singular value decomposition. In particular, we concentrate on the resulting singular vectors, which we refer to as Task Singular Vectors (TSV). Recognizing that layer task matrices are often low-rank, we propose TSV-Compress, a simple procedure that compresses them to 10\% of their original size while retaining 99\% of accuracy. We further leverage this low-rank space to define a new measure of task interference based on the interaction of singular vectors from different tasks. Building on these findings, we introduce TSV-Merge, a novel model merging approach that combines compression with interference reduction, significantly outperforming existing methods.
Poster
Dimitrios Karageorgiou · Symeon Papadopoulos · Ioannis Kompatsiaris · Efstratios Gavves

[ ExHall D ]

Abstract
Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations. We will publicly share our code and data upon acceptance.
Poster
Jaewoo Song · Daemin Park · Kanghyun Baek · Sangyub Lee · Jooyoung Choi · Eunji Kim · Sungroh Yoon

[ ExHall D ]

Abstract
Developing effective visual inspection models remains challenging due to the scarcity of defect data, especially in new or low-defect-rate manufacturing processes. While recent approaches have attempted to generate defect images using image generation models, producing highly realistic defects remains difficult. In this paper, we propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. DefectFill leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions that incorporate defect, object, and cross-attention terms. This approach enables the inpainting diffusion model to precisely capture detailed, localized defect features and seamlessly blend them into defect-free objects. Additionally, we introduce the Low-Fidelity Selection method to further enhance the quality of the generated defect samples. Experiments demonstrate that DefectFill can generate high-quality defect images, and visual inspection models trained on these images achieve state-of-the-art performance on the MVTec AD dataset.
Poster
Alexander Gielisse · Jan van Gemert

[ ExHall D ]

Abstract
Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose classification challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly underperforms compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art from 38.8% to 60.1%. Moreover, we demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.94% while, to our knowledge, no other SIREN classification approach has managed to set a baseline for high-resolution images. We will make all code and results available.
Poster
Nathan Mankovich · Ignacio Santamaria · Gustau Camps-Valls · Tolga Birdal

[ ExHall D ]

Abstract
Flag manifolds encode hierarchical nested sequences of subspaces and serve as powerful structures for various computer vision and machine learning applications. Despite their utility in tasks such as dimensionality reduction, motion averaging, and subspace clustering, current applications are often restricted to extracting flags using common matrix decomposition methods like the singular value decomposition. Here, we address the need for a general algorithm to factorize and work with hierarchical datasets. In particular, we propose a novel, flag-based method that decomposes arbitrary hierarchical real-valued data into a hierarchy-preserving flag representation in Stiefel coordinates. Our work harnesses the potential of flag manifolds in applications including denoising, clustering, and few-shot learning.
Poster
Yiwei Bao · Zhiming Wang · Feng Lu

[ ExHall D ]

Abstract
Thanks to the introduction of large-scale datasets, deep-learning has become the mainstream approach for appearance-based gaze estimation problems. However, current large-scale datasets contain annotation errors and provide only a single vector for gaze annotation, lacking key information such as 3D eyeball structures. Limitations in annotation accuracy and variety have constrained the progress in research and development of deep-learning methods for appearance-based gaze-related tasks. In this paper, we present GazeGene, a new large-scale synthetic gaze dataset with photo-realistic samples. More importantly, GazeGene not only provides accurate gaze annotations, but also offers 3D annotations of vital eye structures such as the pupil, iris, eyeball, optic and visual axes for the first time. Experiments show that GazeGene achieves comparable quality and generalization ability with real-world datasets, even outperforms most existing datasets on high-resolution images. Furthermore, its 3D eyeball annotations expand the application of deep-learning methods on various gaze-related tasks, offering new insights into this field.
Poster
Daosong Hu · Mingyue Cui · Kai Huang

[ ExHall D ]

Abstract
Gaze direction serves as a pivotal indicator for assessing the level of driver attention. While image-based gaze estimation has been extensively researched, there has been a recent shift towards capturing gaze direction from video sequences. This approach encounters notable challenges, including the comprehension of the dynamic pupil evolution across frames and the extraction of head pose information from a relatively static background. To surmount these challenges, we introduce a dual-stream deep learning framework that explicitly models the displacement changes of the pupil through a fine-grained inter-frame attention mechanism and generates weights to adjust gaze embeddings. This technique transforms the face into a set of distinct patches and employs cross-attention to ascertain the correlation between pixel displacements in various patches and adjacent frames, thereby tracking spatial dynamics within the sequence. Our method is validated using two publicly available driver gaze datasets, and the results indicate that it achieves state-of-the-art performance or is on par with the best outcomes while reducing the parameters.
Poster
Ziyang Chen · Prem Seetharaman · Bryan Russell · Oriol Nieto · David Bourgin · Andrew Owens · Justin Salamon

[ ExHall D ]

Abstract
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce *MultiFoley*, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow).MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation.Through automated evaluations and human studies, we demonstrate that *MultiFoley* successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods.
Poster
Zeyue Tian · Zhaoyang Liu · Ruibin Yuan · Jiahao Pan · Qifeng Liu · Xu Tan · Qifeng Chen · Wei Xue · Yike Guo

[ ExHall D ]

Abstract
We present VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that seamlessly match the video content through Long-Short-Term modeling. Furthermore, we present a large-scale dataset comprising 360K video-music pairs encompassing various genres such as movie trailers, advertisements, and documentaries. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment.
Poster
Edson Araujo · Andrew Rouditchenko · Yuan Gong · Saurabhchand Bhati · Samuel Thomas · Brian Kingsbury · Leonid Karlinsky · Rogerio Feris · James Glass · Hilde Kuehne

[ ExHall D ]

Abstract
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames.Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
Poster
Henghui Du · Guangyao Li · Chang Zhou · Chunjie Zhang · Alan Zhao · Di Hu

[ ExHall D ]

Abstract
In recent years, numerous tasks have been proposed to encourage model to develop specified capability in understanding audio-visual scene, primarily categorized into temporal localization, spatial localization, spatio-temporal reasoning, and pixel-level understanding. Instead, human possesses a unified understanding ability for diversified tasks. Therefore, designing an audio-visual model with general capability to unify these tasks is of great value. However, simply joint training for all tasks can lead to interference due to the heterogeneity of audiovisual data and complex relationship among tasks. We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. Specifically, considering the labels of existing datasets are simple words, we carefully refine these datasets and construct an Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process (AV-UIE), which clarifies the cooperative relationship among tasks. Subsequently, to facilitate concrete cooperation in learning stage, an interaction-aware LoRA structure with multiple LoRA heads is designed to learn different aspects of audiovisual data interaction. By unifying the explicit cooperation across the data and model aspect, our method not only surpasses existing unified audio-visual model on multiple tasks, …
Poster
Stefan Smeu · Dragos-Alexandru Boldisor · Dan Oneata · Elisabeta Oneata

[ ExHall D ]

Abstract
Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection---the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
Poster
Qiyao Xue · Xiangyu Yin · Boyuan Yang · Wei Gao

[ ExHall D ]

Abstract
Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
Poster
Tianhao Qi · Jianlong Yuan · Wanquan Feng · Shancheng Fang · Jiawei Liu · SiYu Zhou · Qian HE · Hongtao Xie · Yongdong Zhang

[ ExHall D ]

Abstract
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation.However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored.To address this gap, we introduce Mask2DiT, a novel approach that ensures fine-grained, one-to-one alignment between video segments and their corresponding text annotations.Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture.This mask ensures that each text annotation is applied exclusively to its corresponding video segment, while preserving temporal coherence across all visual tokens.With this attention mask facilitating fine-grained, segment-level textual-to-visual alignment, we adapt the DiT architecture for video generation tasks involving a fixed number of scenes.To further equip the DiT architecture with the capability for generating videos with additional scenes, we incorporate a segment-level conditional mask that treats preceding video segments as context for the final segment, thereby enabling auto-regressive scene extension.Both qualitative and quantitative experiments confirm that Mask2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
Poster
Yijie Xu · Bolun Zheng · Wei Zhu · Hangjia Pan · Yuchen Yao · Ning Xu · An-An Liu · Quan Zhang · Chenggang Yan

[ ExHall D ]

Abstract
Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In this paper, with exploring YouTube’s multilingual and multi-modal content, we construct a new social media temporal popularity prediction benchmark, namely SMTPD, and suggest a baseline framework for temporal popularity prediction. Through data analysis and experiments, we verify that temporal alignment and early popularity play crucial roles in social media popularity prediction for not only deepening the understanding of temporal dynamics of popularity in social media but also offering a suggestion about developing more effective prediction models in thisfield.
Poster
Hui Han · Siyuan Li · Jiaqi Chen · Yiwen Yuan · Yuling Wu · Yufan Deng · Chak Tou Leong · Hanwen Du · Junchen Fu · Youhua Li · Jie Zhang · Chi Zhang · Li-jia Li · Yongxin Ni

[ ExHall D ]

Abstract
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos while aligning with human expectations. Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency.To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces HA-Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. This benchmark represents the first attempt to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, HA-Video-Bench provides a structured, scalable approach to generated video evaluation. Experimental results demonstrate that MLLMs achieve superior alignment with human preferences across all dimensions. Moreover, in instances where our framework’s assessments diverge from human evaluations, it consistently offers more objective and accurate insights, suggesting an even greater potential advantage over traditional human judgment.
Poster
Wang Jiarui · Huiyu Duan · Guangtao Zhai · Juntong Wang · Xiongkuo Min

[ ExHall D ]

Abstract
The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions. The dataset and code will be released upon publication.
Poster
Niu Lian · Jun Li · Jinpeng Wang · Ruisheng Luo · Yaowei Wang · Shu-Tao Xia · Bin Chen

[ ExHall D ]

Abstract
Self-Supervised Video Hashing compresses videos into hash codes for efficient indexing and retrieval by learning meaningful video representations without the need for labeled data. State-of-the-art video hashing methods typically rely on random frame sampling, treating all frames equally. This approach leads to suboptimal hash codes by ignoring frame-specific information density and reconstruction difficulty. To address this limitation, we propose AutoSSVH, a method combining adversarial hard frame mining with hash contrastive learning based on Component Voting. Our adversarial sampling strategy automatically identifies and selects frames with higher reconstruction difficulty, discarding easily reconstructable frames to enhance training rigor and encoding capability. Additionally, we leverage Component Voting in hash contrastive learning, using class-specific anchors and the P2Set paradigm to effectively capture neighborhood information and complex inter-video relationships. Extensive experiments demonstrate that AutoSSVH surpasses existing methods in both accuracy and efficiency. Code and configurations will be released publicly.
Poster
Orr Zohar · Xiaohan Wang · Yann Dubois · Nikhil Mehta · Tong Xiao · Philippe Hansen-Estruch · Licheng Yu · Xiaofang Wang · Felix Juefei-Xu · Ning Zhang · Serena Yeung · Xide Xia

[ ExHall D ]

Abstract
Despite the rapid integration of video perception capabilities into Large Multi-modal Models (LMMs), what drives their video perception remains poorly understood.Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that uncovers what effectively drives video understanding in LMMs.We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover *Scaling Consistency*, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more.Guided by these findings, we introduce **Apollo**, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models process over 1-hour videos efficiently, with the 3B parameter variant outperforming most existing 7B models. **Apollo**-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME. Our code and models will be made available at publication.
Poster
Junbo Niu · Yifei Li · Ziyang Miao · Chunjiang Ge · Zhou Yuanhang · Qihao He · Xiaoyi Dong · Haodong Duan · Shuangrui Ding · Rui Qian · Pan Zhang · Yuhang Zang · Yuhang Cao · Conghui He · Jiaqi Wang

[ ExHall D ]

Abstract
Integrating past information and adapting to continuous video input are pivotal for human-level video understanding. Current benchmarks, however, focus on coarse-grained, video-level question-answering in offline settings, limiting real-time processing and adaptability for practical applications.To this end, we introduce \textbf{OVBench} (\textbf{O}nline-\textbf{V}ideo-\textbf{Bench}mark), which assesses online video understanding through three modes: (1) \textbf{Backward Tracing}, (2) \textbf{Real-Time Visual Perception}, and (3) \textbf{Forward Active Responding}. OVBench consists of 12 tasks, comprising about 2,800 meta-annotations with fine-grained, event-level timestamps paired with 858 videos across 10 domains,encompassing egocentric activities, virtual gaming worlds, and cinematic scenes. To minimize bias, we employ automated generation pipelines and human annotation for meticulous curation.We design an effective problem generation and evaluation pipeline based on these high-quality samples and densely query Video-LLMs across the video streaming timeline. Extensive evaluations of nine Video-LLMs reveal that despite rapid advancements and improving performance on traditional benchmarks, existing models struggle with online video understanding. Our comprehensive evaluation reveals that the best-performing models still have a significant gap compared to human agents in online video understanding.We anticipate that OVBench will guide the development of Video-LLMs towards practical real-world applications and inspire future research in online video understanding.
Poster
Darshana Saravanan · Varun Gupta · Darshan Singh S · Zeeshan Khan · Vineet Gandhi · Makarand Tapaswi

[ ExHall D ]

Abstract
A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision/video models and a move towards long-video understanding. While exciting, we take a step back and ask: are today’s models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (42.5%) and GPT-4o (44.3%), are far from human accuracy at 89.6%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.
Poster
Yuxuan Wang · Yueqian Wang · Bo Chen · Tong Wu · Dongyan Zhao · Zilong Zheng

[ ExHall D ]

Abstract
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.
Poster
Ziyu Ma · Chenhui Gou · Hengcan Shi · Bin Sun · Shutao Li · Hamid Rezatofighi · Jianfei Cai

[ ExHall D ]

Abstract
Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).
Poster
Lucas Ventura · Antoine Yang · Cordelia Schmid · Gul Varol

[ ExHall D ]

Abstract
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our "Chapter-Llama" framework. Specifically, we leverage a pre-trained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcripts and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 18.7\% F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models.
Poster
Tiantian Geng · Jinrui Zhang · Qingni Wang · Teng Wang · Jinming Duan · Feng Zheng

[ ExHall D ]

Abstract
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
Poster
Yuqian Yuan · Hang Zhang · Wentong Li · Zesen Cheng · Boqiang Zhang · Long Li · Xin Li · Deli Zhao · Wenqiao Zhang · Yueting Zhuang · Jianke Zhu · Lidong Bing

[ ExHall D ]

Abstract
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities. Our codes, models, and dataset will be made publicly available.
Poster
Min Jung Lee · Dayoung Gong · Minsu Cho

[ ExHall D ]

Abstract
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using an image caption model and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing …
Poster
Keda Tao · Can Qin · Haoxuan You · Yang Sui · Huan Wang

[ ExHall D ]

Abstract
Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5× inference speedup, 1.4× memory reduction against the baseline VLLM, while still improving the performance, with no training.
Poster
Chirag Parikh · Deepti Rawat · Rakshitha R. T. · Tathagata Ghosh · Ravi Kiran Sarvadevabhatla

[ ExHall D ]

Abstract
We introduce **RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives**. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning **14M frames** and **414K social comments**, resulting in **a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs**. We **evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose)** on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.
Poster
Tanveer Hannan · Md Mohaiminul Islam · Jindong Gu · Thomas Seidl · Gedas Bertasius

[ ExHall D ]

Abstract
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths—from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (e.g., +2.6\% R1@0.1 on MAD). The code is available in the supplementary and will be released.
Poster
Ali Athar · Xueqing Deng · Liang-Chieh Chen

[ ExHall D ]

Abstract
Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. All annotations, as well as the code and model weights will be made public.
Poster
Shehan Munasinghe · Hanan Gani · Wenqi Zhu · Jiale Cao · Eric P. Xing · Fahad Shahbaz Khan · Salman Khan

[ ExHall D ]

Abstract
Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V→L and L→V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
Poster
Yanjun Li · Zhaoyang Li · Honghui Chen · li'Zhi Xu

[ ExHall D ]

Abstract
Video Scene Graph Generation (VidSGG) aims to capture dynamic relationships among entities by sequentially analyzing video frames and integrating visual and semantic information. However, VidSGG is challenged by significant biases that skew predictions. To mitigate these biases, we propose a \textbf{VI}sual and \textbf{S}emantic \textbf{A}wareness (VISA) framework for unbiased VidSGG. VISA addresses visual bias through an innovative memory update mechanism that enhances object representations and concurrently reduces semantic bias by iteratively integrating object features with comprehensive semantic information derived from triplet relationships. This visual-semantics dual debiasing approach results in more unbiased representations of complex scene dynamics. Extensive experiments demonstrate the effectiveness of our method, where VISA outperforms existing unbiased VidSGG approaches by a substantial margin (e.g., +13.1\% improvement in mR@20 and mR@50 for the SGCLS task under Semi Constraint).
Poster
Hang Yin · Xiuwei Xu · Linqing Zhao · Ziwei Wang · Jie Zhou · Jiwen Lu

[ ExHall D ]

Abstract
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning.Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages.Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot …
Poster
Feiyu Pan · Hao Fang · Fangkai Li · Yanyu Xu · Yawei Li · Luca Benini · Xiankai Lu

[ ExHall D ]

Abstract
Referring video object segmentation (RVOS) aims to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select" paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a light adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm reuses the visual-language alignment of VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by bridging the gap with complex activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining remarkably low computational costs. Code is available at https://github.com/anonymous61888/SSA.
Poster
Yunxiang Fu · Meng Lou · Yizhou Yu

[ ExHall D ]

Abstract
High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce LAMSeg, a novel linear-time model comprising a hybrid feature encoder dubbed LAMNet, and a decoder based on state space models. Specifically, LAMNet synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate LAMSeg on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, LAMSeg-B achieves 52.1\% mIoU on ADE20K, outperforming SegNeXt-L by 1.1\% mIoU while reducing computational complexity by over 20 GFLOPs. On Cityscapes, LAMSeg-B attains 83.8\% mIoU, surpassing SegFormer-B3 by 2.1\% mIoU with approximately half the GFLOPs. Similarly, LAMSeg-B improves upon VWFormer-B3 by 0.9\% mIoU with lower GFLOPs on COCO-Stuff dataset.
Poster
Farzad Beizaee · Gregory A. Lodygensky · Christian Desrosiers · Jose Dolz

[ ExHall D ]

Abstract
The advanced image generation capabilities of recent diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining pixel-level structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions in an image while preserving normal ones. This leads to potential degradation of normal regions during the reconstruction process, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies.Our proposed Deviation correction diffusion (DeCo-Diff) model preserves the normal regions and encourages transformations exclusively on anomalous areas. By modeling anomalies as noise in the latent space, our method leverages the learned distribution of normal images to accurately reconstruct normal regions while altering only the anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well-known anomaly detection datasets. …
Poster
Zhenghao Xing · Hao Chen · Binzhu Xie · Jiaqi Xu · Ziyu Guo · Xuemiao Xu · Jianye Hao · Chi-Wing Fu · Xiaowei Hu · Pheng-Ann Heng

[ ExHall D ]

Abstract
Traffic Anomaly Understanding (TAU) is essential for improving public safety and transportation efficiency by enabling timely detection and response to incidents. Beyond existing methods, which rely largely on visual data, we propose to consider audio cues, a valuable source that offers strong hints to anomaly scenarios such as crashes and honking. Our contributions are twofold. First, we compile AV-TAU, the first large-scale audio-visual dataset for TAU, providing 29,865 traffic anomaly videos and 149,325 Q&A pairs, while supporting five essential TAU tasks. Second, we develop EchoTraffic, a multimodal LLM that integrates audio and visual data for TAU, through our audio-insight frame selector and dynamic connector to effectively extract crucial audio cues for anomaly understanding with a two-phase training framework. Experimental results on AV-TAU manifest that EchoTraffic sets a new SOTA performance in TAU, outperforming the existing multimodal LLMs. Our contributions, including AV-TAU and EchoTraffic, pave a new direction for multimodal TAU. Our dataset and code will be publicly available upon publication of this work.
Poster
Han Hu · Wenli Du · Peng Liao · Bing Wang · Siyuan Fan

[ ExHall D ]

Abstract
Due to the interference of background noise, existing video anomaly detection methods are prone to detect some normal events in complex scenes as anomalies. Meanwhile, we note that the diversity of normal patterns has not been adequately considered, i.e., the normal events that are worthy of reference in the test data have not been properly utilized, which raises the risk of missed and false detections. In this work, we combine the tasks of next-frame prediction and predicted-frame reconstruction to propose a noise-resistant video anomaly detection method. For the prediction task, we develop an RGB Error-Guided Multiscale Predictive Coding (EG-MPC) framework to overcome the interference of background noise on the learning of appearance and motion features of objects at various scales, thus achieving high-quality frame prediction. For the reconstruction task, we introduce the Dynamic Memory Modules (DMMs) into the reconstruction network, and design sparse aggregation and selective update strategies for the memory items in the DMMs to effectively represent diverse normal patterns while increasing the difficulty of reconstructing anomalies, thus making it easier to distinguish between normal and abnormal frames. Extensive experiments on four benchmark datasets demonstrate that our proposed method outperforms state-of-the-art approaches, especially in complex scenarios.
Poster
Yuhan Shen · Ehsan Elhamifar

[ ExHall D ]

Abstract
We introduce and develop a framework for Multi-Task Temporal Action Segmentation (MT-TAS), a novel paradigm that addresses the challenges of interleaved actions when performing multiple tasks simultaneously. Traditional action segmentation models, trained on single-task videos, struggle to handle task switches and complex scenes inherent in multi-task scenarios. To overcome these challenges, our MT-TAS approach synthesizes multi-task video data from single-task sources using our Multi-task Sequence Blending and Segment Boundary Learning modules. Additionally, we propose to dynamically isolate foreground and background elements within video frames, addressing the intricacies of object layouts in multi-task scenarios and enabling a new two-stage temporal action segmentation framework with Foreground-Aware Action Refinement. Also, we introduce the Multi-task Egocentric Kitchen Activities (MEKA) dataset, containing 12 hours of egocentric multi-task videos, to rigorously benchmark MT-TAS models. Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments.
Poster
Mengmeng Wang · Zeyi Huang · Xiangjie Kong · Guojiang Shen · Guang Dai · Jingdong Wang · Yong Liu

[ ExHall D ]

Abstract
Video action recognition involves interpreting both global context and specific details to accurately identify actions. While previous models are effective at capturing spatiotemporal features, they often lack a focused representation of key action details. To address this, we introduce FocusVideo, a framework designed for refining video action recognition through integrated global and local feature learning. Inspired by human visual cognition theory, our approach balances the focus on both broad contextual changes and action-specific details, minimizing the influence of irrelevant background noise. We first employ learnable action queries to selectively emphasize action-relevant regions without requiring region-specific labels. Next, these queries are learned by a local action streaming branch that enables progressive query propagation. Moreover, we introduce a parameter-free feature interaction mechanism for effective multi-scale interaction between global and local features with minimal additional overhead. Extensive experiments demonstrate that FocusVideo achieves state-of-the-art performance across multiple action recognition datasets, validating its effectiveness and robustness in handling action-relevant details.
Poster
Ziyu Yao · Xuxin Cheng · Zhiqi Huang · Lei Li

[ ExHall D ]

Abstract
Repetitive action counting, which aims to count periodic movements in a video, is valuable for video analysis applications such as fitness monitoring. However, existing methods largely rely on handcrafted models with limited representational capacity, which hampers their ability to accurately capture variable periodic patterns. Additionally, their supervised learning on narrow, limited training sets leads to overfitting and restricts their ability to generalize across diverse scenarios. To address these challenges, we propose CountLLM, the first large language model (LLM)-based framework that takes video data and periodic text prompts as inputs and outputs the desired counting value. CountLLM leverages the rich clues from explicit textual instructions and the powerful representational capabilities of pre-trained LLMs for repetitive action counting. To effectively guide CountLLM, we develop a periodicity-based structured template for instructions that describes the properties of periodicity and implements a standardized answer format to ensure consistency. Additionally, we propose a progressive multimodal training paradigm to enhance the periodicity-awareness of the LLM. Empirical evaluations on widely recognized benchmarks demonstrate CountLLM's superior performance and generalization, particularly in handling novel and out-of-domain actions that deviate significantly from the training data, offering a promising avenue for repetitive action counting.
Poster
Xiaoyan Ma · jidong kuang · Hongsong Wang · Jie Gui

[ ExHall D ]

Abstract
Skeleton-based human action recognition has received widespread attention in recent years due to its diverse range of application scenarios. Due to the different sources of human skeletons, skeleton data naturally exhibit heterogeneity. The previous works, however, overlook the heterogeneity of human skeletons and solely construct models tailored for homogeneous skeletons. This work addresses the challenge of heterogeneous skeleton-based action representation learning, specifically focusing on processing skeleton data that varies in joint dimensions and topological structures. The proposed framework comprises two primary components: heterogeneous skeleton processing and unified representation learning. The former first converts two-dimensional skeleton data into three-dimensional skeleton via an auxiliary network, and then constructs a prompted unified skeleton using skeleton-specific prompts. We also design an additional modality named semantic motion encoding to harness the semantic information within skeletons. The latter module learns a unified action representation using a shared backbone network that processes different heterogeneous skeletons, which have been processed by the former module. Extensive experiments on the NTU-60, NTU-120, and PKU-MMD II datasets demonstrate the effectiveness of our method in various tasks, such as action recognition, action retrieval and semi-supervised action recognition.
Poster
Xiaohai Li · Bineng Zhong · Qihua Liang · Zhiyi Mo · Jian Nong · Shuxiang Song

[ ExHall D ]

Abstract
The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency.Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker.DUTrack achieves new state-of-the-art performance on four mainstream vision-language …
Poster
Yu Guo · Weiquan Liu · Qingshan Xu · Shijun Zheng · Shujun Huang · Yu Zang · Siqi Shen · Chenglu Wen · Cheng Wang

[ ExHall D ]

Abstract
Adversarial Examples can mislead deep neural networks with subtle perturbations, causing them to make incorrect predictions. Notably, adversarial examples crafted for one model can also deceive other models, a phenomenon known as the transferability of adversarial examples. To improve transferability, existing research has designed various mechanisms centered on the complex interactions between data and models. However, their improvements are relatively limited. Moreover, since these methods are often designed for a specific data modality, this greatly restricts their scalability on other data modalities. In this work, we observe a mirroring relationship between model generalization and adversarial example transferability. Motivated by this observation, we propose an augmentation-based attack, called OPS (Operator-Perturbation-based Stochastic optimization), which constructs a stochastic optimization problem by input transformation operators and random perturbations, and solves this problem to generate adversarial examples with better transferability. Extensive experiments on both images and 3D point clouds demonstrate that OPS significantly outperforms existing SOTA methods in terms of both performance and cost, showcasing the universality and superiority of our approach.
Poster
Yuning Han · Bingyin Zhao · Rui Chu · Feng Luo · Biplab Sikdar · Yingjie Lao

[ ExHall D ]

Abstract
Recent studies show that diffusion models (DMs) are vulnerable to backdoor attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray box and eyeglasses) that contain evident patterns, rendering remarkable attack effects yet easy detection upon human inspection and defensive algorithms. While it is possible to improve stealthiness by reducing the strength of the backdoor, doing so can significantly compromise its generality and effectiveness. In this paper, we propose UIBDiffusion, the universal imperceptible backdoor attack for diffusion models, which allows us to achieve superior attack and generation performance while evading state-of-the-art defenses. We propose a novel trigger generation approach based on universal adversarial perturbations (UAPs) and reveal that such perturbations, which are initially devised for fooling pre-trained discriminative models, can be adapted as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on multiple types of DMs with different kinds of samplers across various datasets and targets. Experimental results demonstrate that UIBDiffusion brings three advantages: 1) Universality, the imperceptible trigger is universal (i.e., image and model agnostic) where a single trigger is effective to any images and all diffusion models with different samplers; 2) Utility, it achieves comparable generation quality (e.g., FID) and even better attack success rate (i.e., ASR) …
Poster
Wei Ao · Vishnu Naresh Boddeti

[ ExHall D ]

Abstract
Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly concerning unauthorized access to sensitive biometric data. This paper presents CryptoFace, the first end-to-end encrypted face recognition system that ensures secure processing of facial data from acquisition through storage and matching without exposing raw facial images or features at any stage. CryptoFace leverages fully homomorphic encryption (FHE) for encrypted feature extraction, feature matching, and comparison while maintaining high face recognition performance. It employs a mixture of shallow patch convolutional networks (PCNNs), which can be evaluated in parallel with FHE and lead to much faster inference. It is scalable to high-resolution face data without sacrificing inference speed and optimizes homomorphic neural architecture by minimizing the multiplicative depth. We evaluate the performance and computational efficiency of CryptoFace on multiple encrypted benchmark face datasets. CryptoFace exhibits a significant acceleration of 7.2× (9,845s to 1,364s per image on CPU) compared to state-of-the-art FHE-based neural networks adapted for face recognition. CryptoFace will facilitate the deployment of secure biometric authentication systems in applications requiring strict privacy and security guarantees.
Poster
Xinjie Cui · Yuezun Li · Ao Luo · Jiaran Zhou · Junyu Dong

[ ExHall D ]

Abstract
We describe the Forensics Adapter, an adapter network designed to transform CLIP into an effective and generalizable face forgery detector. Although CLIP is highly versatile, adapting it for face forgery detection is non-trivial as forgery-related knowledge is entangled with a wide range of unrelated knowledge. Existing methods treat CLIP merely as a feature extractor, lacking task-specific adaptation, which limits their effectiveness. To address this, we introduce an adapter to learn face forgery traces -- the blending boundaries unique to forged faces, guided by task-specific objectives. Then we enhance the CLIP visual tokens with a dedicated interaction strategy that communicates knowledge across CLIP and the adapter. Since the adapter is alongside CLIP, its versatility is highly retained, naturally ensuring strong generalizability in face forgery detection. With only 5.7M trainable parameters, our method achieves a significant performance boost, improving by approximately 7\% on average across five standard datasets. We believe the proposed method can serve as a baseline for future CLIP-based face forgery detection methods.
Poster
Haoran Wang · Xinji Mai · Zeng Tao · Xuan Tong · Junxiong Lin · Yan Wang · Jiawen Yu · Shaoqi Yan · Ziheng Zhou · Wenqiang Zhang

[ ExHall D ]

Abstract
The current advancements in Dynamic Facial Expression Recognition (DFER) methods mainly focus on better capturing the spatial and temporal features of facial expressions. However, DFER datasets contain a substantial amount of noisy samples, and few have addressed the issue of handling this noise. We identified two types of noise: one is caused by low-quality data resulting from factors such as occlusion, dim lighting, and blurriness; the other arises from mislabeled data due to annotation bias by annotators. Addressing the two types of noise, we have meticulously crafted a \textbf{D}ynamic \textbf{D}ual-\textbf{S}tage \textbf{P}urification (D2SP) Framework. This initiative aims to dynamically purify the DFER datasets of these two types of noise, ensuring that only high-quality and correctly labeled data is used in the training process. To mitigate low-quality samples, we introduce the Coarse-Grained Pruning (CGP) stage, which computes sample weights and prunes those low-weight samples. After CGP, the Fine-Grained Correction (FGC) stage evaluates prediction stability to correct mislabeled data. Moreover, D2SP is conceived as a general and plug-and-play framework, tailored to integrate seamlessly with prevailing DFER methods. Extensive experiments covering prevalent DFER datasets and deploying multiple benchmark methods have substantiated D2SP’s ability to significantly enhance performance metrics.
Poster
Tianyi Wang · Zichen Wang · Cong Wang · Yuanchao Shu · Ruilong Deng · Peng Cheng · Jiming Chen

[ ExHall D ]

Abstract
Object detection is a fundamental enabler for many real-time downstream applications such as autonomous driving, augmented reality and supply chain management. However, the algorithmic backbone of neural networks is brittle to imperceptible perturbations in the system inputs, which were generally known as misclassifying attacks. By targeting the real-time processing capability, a new class of latency attacks are reported recently. They exploit new attack surfaces in object detectors by creating a computational bottleneck in the post-processing module, that leads to cascading failure and puts the real-time downstream tasks at risks. In this work, we take an initial attempt to defend against this attack via background-attentive adversarial training that is also cognizant of the underlying hardware capabilities. We first draw system-level connections between latency attack and hardware capacity across heterogeneous devices. Based on the particular adversarial behaviors, we utilize objectness loss as a proxy and build background attention into the adversarial training pipeline, and achieve a reasonable balance between clean and robust accuracy. The extensive experiments demonstrate the defense effectiveness of restoring real-time processing capability from 13 FPS to 43 FPS on Jetson Orin NX, with a better trade-off between the clean and robust accuracy.
Poster
Wei Huang · Qinying Gu · Nanyang Ye

[ ExHall D ]

Abstract
Offline reinforcement learning (RL) enables policy training solely on pre-collected data, avoiding direct environment interaction—a crucial benefit for energy-constrained embodied AI applications. Although Artificial Neural Networks (ANN)-based methods perform well in offline RL, their high computational and energy demands motivate exploration of more efficient alternatives. Spiking Neural Networks (SNNs) show promise for such tasks, given their low power consumption. In this work, we introduce DSFormer, the first spike-driven transformer model designed to tackle offline RL via sequence modeling. Unlike existing SNN transformers focused on spatial dimensions for vision tasks, we develop Temporal Spiking Self-Attention (TSSA) and Positional Spiking Self-Attention (PSSA) in DSFormer to capture the temporal and positional dependencies essential for sequence modeling in RL. Additionally, we propose Progressive Threshold-dependent Batch Normalization (PTBN), which combines the benefits of LayerNorm and BatchNorm to preserve temporal dependencies while maintaining the spiking nature of SNNs. Comprehensive results in the D4RL benchmark show DSFormer’s superiority over both SNN and ANN counterparts, achieving 78.4\% energy savings, highlighting DSFormer's advantages not only in energy efficiency but also in competitive performance.
Poster
Zhiqi Pang · Junjie Wang · Lingling Zhao · Chunyu Wang

[ ExHall D ]

Abstract
Clothing change person re-identification (CC-ReID) aims to match different images of the same person, even when the clothing varies across images. To reduce manual labeling costs, existing unsupervised CC-ReID methods employ clustering algorithms to generate pseudo-labels. However, they often fail to assign the same pseudo-label to two images with the same identity but different clothing—referred to as a clothing change positive pair—thus hindering clothing-invariant feature learning. To address this issue, we propose the identity-clothing similarity modeling (ICSM) framework. To effectively connect clothing change positive pairs, ICSM first performs clothing-aware learning to leverage all discriminative information, including clothing, to obtain compact clusters. It then extracts cluster-level identity and clothing features and performs inter-cluster similarity estimation to identify clothing change positive clusters, reliable negative clusters, and hard negative clusters for each compact cluster. During optimization, we design an adaptive version of existing optimization methods to enhance similarities of clothing change positive pairs, while also introducing text semantics as a supervisory signal to further promote clothing invariance. Extensive experimental results across multiple datasets validate the effectiveness of the proposed framework, demonstrating its superiority over existing unsupervised methods and its competitiveness with some supervised approaches.
Poster
Jinxi Yang · He Li · Bo Du · Mang Ye

[ ExHall D ]

Abstract
Person re-identification (ReID) is the task of matching individuals across different camera views. Existing approaches typically employ neural networks to extract discriminative features, ranking gallery images based on their similarities to probe images. While effective, these methods are often further enhanced through re-ranking, which refines the initial retrieval results without additional training. However, current re-ranking methods mostly rely on k-nearest neighbor search to extract similar images that might have the same identity as the query, which is time-consuming with a high computation burden, limiting their applications in reality. We rethink the effect of the k-nearest neighbor search and introduce Chebyshev's Theorem-guided Graph Re-ranking (Cheb-GR) method which adopts the adaptive neighbor search guided by Chebyshev's Theorem over the k-nearest neighbor search for efficient neighbor selection. Our method leverages graph convolution operations to refine image features and achieve robust re-ranking, leading to enhanced retrieval performance. Furthermore, we provide a theoretical analysis based on Chebyshev's Inequality to elucidate the factors contributing to the strong performance of the proposed method. Our method significantly reduces the computation costs while maintaining relatively strong performance. Through extensive experiments in both general and cross-domain settings, we demonstrate the effectiveness of Cheb-GR and its potential for real-world applications. Code …
Poster
Ji Du · Fangwei Hao · Mingyang Yu · Desheng Kong · Jiesheng Wu · Bin Wang · Jing XU · Ping Li

[ ExHall D ]

Abstract
Camouflaged Object Detection (COD) seeks to distinguish objects from their highly similar backgrounds. Existing work has essentially focused on isolating camouflaged objects from the environment, demonstrating ever-improving performance but at the cost of extensive annotations and complex optimizations. In this paper, we diverge from this paradigm and shift the lens to isolating the salient environment from the camouflaged object. We introduce EASE, an Environment-Aware unSupErvised COD framework that identifies the environment by referencing an environment prototype library and detects camouflaged objects by inverting the retrieved environmental features. Specifically, our approach (DiffPro) uses large multimodal models, diffusion models, and vision-foundation models to construct the environment prototype library. To retrieve environments from the library and refrain from confusing foreground and background, we incorporate three retrieval schemes: Kernel Density Estimation-based Adaptive Threshold (KDE-AT), Global-to-Local pixel-level retrieval (G2L), and Self-Retrieval (SR). Our experiments demonstrate significant improvements over current unsupervised methods, with EASE achieving an average gain of over 10\% on the COD10K dataset. When integrated with SAM, EASE surpasses prompt-based segmentation approaches and performs competitively with state-of-the-art fully-supervised methods.
Poster
Yi Yu · Botao Ren · Peiyuan Zhang · Mingxin Liu · Junwei Luo · Shaofeng Zhang · Feipeng Da · Junchi Yan · Xue Yang

[ ExHall D ]

Abstract
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. The source code will be made publicly available.
Poster
Hang Zhou · Xinxin Zuo · Rui Ma · Li Cheng

[ ExHall D ]

Abstract
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to minimize the need for dense supervision, which unfortunately may limit their ability to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been employed; yet their over-relaxed regularization often leads to imprecise placement. We propose BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our method first identifies regions of interest suitable for object placement by training a dedicated detection transformer on object-subtracted backgrounds with multi-object supervisions. It then associates each target compositing object with detected regions based on semantic complementary. Using a boostrapped training approach on randomly object-subtracted images, our model regularizes meaningful placements through richly paired data augmentation. Experimental results on standard benchmarks demonstrate BOOTPLACE's superior performance in object reposition, significantly outperforming state-fo-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
Poster
Fangyun Wei · Jinjing Zhao · Kun Yan · Chang Xu

[ ExHall D ]

Abstract
Traditional video instance segmentation (VIS) models rely on extensive per-frame video annotations, which are both time-consuming and costly. In this paper, we present MinMaxVIS, a novel VIS framework that reduces the dependency on fully labeled video datasets by utilizing a small set of labeled images from the target domain along with a large volume of general-domain, unlabeled images. MinMaxVIS operates in three stages: first, a preliminary segmentation model is trained on the small labeled set from the target domain; this model then retrieves relevant instances from the unlabeled dataset to build a high-quality pseudo-labeled set, ensuring a rich content alignment with the target domain while avoiding the inefficiencies of large-scale semi-supervised learning across the entire unlabeled set. Finally, we train MinMaxVIS on a combination of labeled and pseudo-labeled data, addressing challenges such as noise in pseudo-labels and instance association across frames. To simulate object continuity, we augment static images to create paired frames, allowing MinMaxVIS to capture instance associations effectively. MinMaxVIS outperforms the prior image-driven approach, MinVIS, achieving superior mAP scores with significantly reduced labeled data. For instance, MinMaxVIS with a Swin-L backbone attains 62.2 mAP on YouTube-VIS 2019 using only 2% labeled data and additional unlabeled images from SA-1B. …
Poster
Jiacheng Sun · Xinghong Zhou · Yiqiang Wu · Bin Zhu · Jiaxuan Lu · Yu Qin · Xiaomao Li

[ ExHall D ]

Abstract
One of the roadblocks for instance segmentation today is heavy computational overhead and model parameters. Previous methods based on Polar Representation made the initial mark to address this challenge by formulating instance segmentation as polygon detection, but failed to align with mainstream methods in performance. In this paper, we highlight that Representation Errors, arising from the limited capacity of polygons to capture boundary details, have long been overlooked, which results in severe performance degradation. Observing that optimal starting point selection effectively alleviates this issue, we propose an Adaptive Polygonal Sample Decision strategy to dynamically capture the positional variation of representation errors across samples. Additionally, we design a Union-aligned Rasterization Module to incorporate these errors into polygonal assessment, further advancing the proposed strategy. With these components, our framework called PolarNeXt achieves a remarkable performance boost of over 4.8% AP compared to other polar-based methods. PolarNeXt is markedly more lightweight and efficient than state-of-the-art instance segmentation methods, while achieving comparable segmentation accuracy. We expect this work will open up a new direction for instance segmentation in high-resolution images and resource-limited scenarios.
Poster
Jihuai Zhao · Junbao Zhuo · Jiansheng Chen · Huimin Ma

[ ExHall D ]

Abstract
In the field of zero-shot 3D instance segmentation, existing 2D-to-3D lifting methods typically obtain 2D segmentation across multiple RGB frames using vision foundation models, which are then projected and merged into 3D space. However, since the inference of vision foundation models on a single frame is not integrated with adjacent frames, the masks of the same object may vary across different frames, leading to a lack of view consistency in the 2D segmentation. Furthermore, current lifting methods average the 2D segmentation from multiple views during the projection into 3D space, causing low-quality masks and high-quality masks to share the same weight. These factors can lead to fragmented 3D segmentation. In this paper, we present SAM2Object, a novel zero-shot 3D instance segmentation method that effectively utilizes the Segment Anything Model 2 to segment and track objects, consolidating view consistency across frames. Our approach combines these consistent 2D masks with 3D geometric priors, improving the robustness of 3D segmentation. Additionally, we introduce mask consolidation module to filter out low-quality masks across frames, which enables more precise 3D-to-2D matching. Comprehensive evaluations on ScanNetV2, ScanNet++ and ScanNet200 demonstrate the robustness and effectiveness of SAM2Object, showcasing its ability to outperform previous methods.
Poster
Jiaxin Zhang · Junjun Jiang · Youyu Chen · Kui Jiang · Xianming Liu

[ ExHall D ]

Abstract
Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results demonstrate that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-train model, yielding clear boundaries while preserving high visual quality.
Poster
Bo-Wen Yin · Jiao-Long Cao · Ming-Ming Cheng · Qibin Hou

[ ExHall D ]

Abstract
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to leverage a memory token as the query to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that \nameofmethod{} exhibits exceptional performance in various RGBD semantic segmentation benchmarks.
Poster
Chongkai Yu · Ting Liu · Li Anqi · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Xiaolin Hu

[ ExHall D ]

Abstract
Interactive segmentation is to segment the mask of the target object according to the user’s interactive prompts. There are two mainstream strategies: early fusion and late fusion. Current specialist models utilize the early fusion strategy that encodes the combination of images and prompts to target the prompted objects, yet repetitive complex computations on the images result in high latency. Late fusion models extract image embeddings once and merge them with the prompts in later interactions. This strategy avoids redundant image feature extraction and improves efficiency significantly. A recent milestone is the Segment Anything Model (SAM). However, this strategy limits the models ability to extract detailed information from the prompted target zone. To address this issue, we propose SAM-REF, a two-stage refinement framework that fully integrates images and prompts by using a lightweight refiner into the interaction of late fusion, which combines the accuracy of early fusion and maintains the efficiency of late fusion. Through extensive experiments, we show that our SAM-REF model outperforms the current state-of-the-art method in most metrics on segmentation quality without compromising efficiency.
Poster
Subhransu S. Bhattacharjee · Dylan Campbell · Rahul Shome

[ ExHall D ]

Abstract
Can objects that are not visible in an image---but are in the vicinity of the camera---be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.
Poster
Ege Özsoy · Chantal Pellegrini · Tobias Czempiel · Felix Tristram · Kun yuan · David Bani-Harouni · Ulrich Eck · Benjamin Busam · Matthias Keicher · Nassir Navab

[ ExHall D ]

Abstract
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. We will publish all our code and dataset upon acceptance.
Poster
Jinlong Li · Cristiano Saltori · Fabio Poiesi · Nicu Sebe

[ ExHall D ]

Abstract
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models—such as CLIP, Dinov2, and Stable Diffusion—into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities.
Poster
Chenyangguang Zhang · Alexandros Delitzas · Fangjinhua Wang · Ruida Zhang · Xiangyang Ji · Marc Pollefeys · Francis Engelmann

[ ExHall D ]

Abstract
We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs.
Poster
Junsheng Wang · Nieqing Cao · Yan Ding · Mengying Xie · Fuqiang Gu · Chao Chen

[ ExHall D ]

Abstract
Generating layouts from textual descriptions by large language models (LLMs) plays a crucial role in precise spatial reasoning-induced domains such as robotic object rearrangement and text-to-image generation. However, current methods face challenges in limited real-world examples, handling diverse layout descriptions and varying levels of granularity. To address these issues, a novel framework named Spatial Knowledge Enhanced Layout (SKE-Layout), is introduced. SKE-Layout integrates mixed spatial knowledge sources, leveraging both real and synthetic data to enhance spatial contexts. It utilizes diverse representations tailored to specific tasks and employs contrastive learning and multitask learning techniques for accurate spatial knowledge retrieval. This framework generates more accurate and fine-grained visual layouts for object rearrangement and text-to-image generation tasks, achieving improvements of 5\%-30\% compared to existing methods.
Poster
Hsiang-Wei Huang · Fu-Chen Chen · Wenhao Chai · Che-Chun Su · Lu Xia · Sanghun Jung · Cheng-Yen Yang · Jenq-Neng Hwang · Min Sun · Cheng-Hao Kuo

[ ExHall D ]

Abstract
Recent advancements in 3D Large Multi-modal Models (3D-LMMs) have driven significant progress in 3D question answering. However, recent multi-frame Vision-Language Models (VLMs) demonstrate superior performance compared to 3D-LMMs on 3D question answering tasks, largely due to the greater scale and diversity of available 2D image data in contrast to the more limited 3D data. Multi-frame VLMs, although achieving superior performance, suffer from the difficulty of retaining all the detailed visual information in the 3D scene while limiting the number of visual tokens. Common methods such as token pooling, reduce visual token usage but often lead to information loss, impairing the model’s ability to preserve visual details essential for 3D question answering tasks. To address this, we propose voxel-based Dynamic Token Compression (DTC), which combines 3D spatial priors and visual semantics to achieve over 90% reduction in visual tokens usage for current multi-frame VLMs. Our method maintains performance comparable to state-of-the-art models on 3D question answering benchmarks including OpenEQA and ScanQA, demonstrating its effectiveness.
Poster
Zhihao Yuan · Yibo Peng · Jinke Ren · Yinghong Liao · Yatong Han · Chun-Mei Feng · Hengshuang Zhao · Guanbin Li · Shuguang Cui · Zhen Li

[ ExHall D ]

Abstract
Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.
Poster
Damiano Marsili · Rohun Agrawal · Yisong Yue · Georgia Gkioxari

[ ExHall D ]

Abstract
Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Recent progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To better assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference.We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks.
Poster
Ziyi Bai · Hanxuan Li · Bin Fu · Chuyan Xiong · Ruiping Wang · Xilin Chen

[ ExHall D ]

Abstract
This paper explores leveraging large language models (LLMs) as low-level action planners for general embodied instruction following tasks. LLMs excel at serving as the “brain” of robots by handling high-level task planning but lack the ability to directly generate precise low-level actions to guide the “body”. This limitation arises from a disconnect between high-level conceptual understanding and low-level spatial perception. We address this challenge by bridging the gap, enabling LLMs to not only understand complex instructions but also produce precise, actionable plans. To achieve this, we introduce Room to Chessboard (R2C), a semantic representation that maps environments onto a grid-based chessboard, enabling LLMs to generate specific low-level coordinates and effectively guide robots as if playing a game of chess. We further propose a Chain-of-Thought Decision (CoT-D) paradigm to enhance the LLMs’ decision-making ability by improving interpretability and context-awareness. By jointly training LLMs for high-level task decomposition and low-level action generation, we create a unified “brain-body” system capable of handling complex, free-form instructions while producing precise low-level actions to make robot adapt to dynamic environments in real time. We validate R2C using both fine-tuned open-source LLMs and GPT-4, demonstrating effectiveness on the challenging ALFRED benchmark. Results show that with our R2C …
Poster
Yi Fang · Bowen Jin · Jiacheng Shen · Sirui Ding · Qiaoyu Tan · Jiawei Han

[ ExHall D ]

Abstract
The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework.However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG).It is underexplored how MLLMs can incorporate the relational information (i.e., graph structure) and semantic information (i.e., texts and images) on such graphs for multimodal comprehension and generation.In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs.Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs.Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method.
Poster
Yuchen Sun · Shanhui Zhao · Tao Yu · Hao Wen · Samith Va · Mengwei Xu · Yuanchun Li · Chongyang Zhang

[ ExHall D ]

Abstract
GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10\% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
Poster
XiMing Xing · Juncheng Hu · Guotao Liang · Jing Zhang · Dong Xu · Qian Yu

[ ExHall D ]

Abstract
The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of scalable vector graphics (SVG) generation. While LLMs encode partial knowledge of SVG data from web pages during training, recent findings suggest that semantically ambiguous and tokenized representations within LLMs may result in hallucinations in vector primitive predictions. Furthermore, LLM training lacks modeling and understanding of the rendering sequence of vector paths, resulting in occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. LLM4SVG facilitates a deeper understanding of SVG components through learnable semantic tokens, precisely encoding these tokens and their corresponding properties to generate semantically aligned SVG output. Using a series of learnable semantic tokens, a structured dataset for instruction following is developed to support comprehension and generation across two primary tasks. Our method introduces a modular architecture to existing large language models (LLMs), integrating semantic tags, vector instruction encoders, fine-tuned commands, and powerful LLMs to tightly combine geometric, appearance, and language information. To overcome the scarcity of SVG-text instruction data, we developed …
Poster
Kevin Qinghong Lin · Linjie Li · Difei Gao · Zhengyuan Yang · Shiwei Wu · Zechen Bai · Stan Weixian Lei · Lijuan Wang · Mike Zheng Shou

[ ExHall D ]

Abstract
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents.In this work, we develop a vision-language-action model in the digital world, namely "Our Model," which features the following innovations:1. **UI-Guided Visual Token Selection** to reduce computational costs by formulating screenshots as a UI-connected graph, adaptively identifying their redundant relationships and serving as the criteria for token selection during self-attention blocks. 2. **Interleaved Vision-Language-Action Streaming** that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. 3. **Small-Scale High-Quality GUI Instruction-Following Datasets** by careful data curation and employing a resampling strategy to address significant data type imbalances. With the above components, our model, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up performance by 1.4×. Navigation experiments across web, mobile, and online environments further …
Poster
Xu Cao · Pranav Virupaksha · Wenqi Jia · Bolin Lai · Fiona Ryan · Sangmin Lee · James Rehg

[ ExHall D ]

Abstract
Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models' (VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field.
Poster
Jun Gao · Yongqi Li · Ziqiang Cao · Wenjie Li

[ ExHall D ]

Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer.However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image.In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill.Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs.ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency.ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs.We apply ADS to realize ICoT on two popular VLMs of different architectures.Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.
Poster
Guillaume Astruc · Nicolas Gonthier · Clement Mallet · Loic Landrieu

[ ExHall D ]

Abstract
Geospatial models must adapt to the diversity of Earth observation data in terms of resolutions, scales, and modalities. However, existing approaches expect fixed input configurations, which limits their practical applicability. We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and resolution-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics and 11 distinct sensors. We then train a single powerful model on these diverse datasets simultaneously. Once fine-tuned, we achieve better or near state-of-the-art results on the datasets of GeoPlex and 3 additional ones for 4 environment monitoring tasks: land cover mapping, crop type classification, change detection, and forest analysis. We will release all codes, models, and data.
Poster
Peijie Wang · Zhong-Zhi Li · Fei Yin · Dekang Ran · Cheng-Lin Liu

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs’ mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.
Poster
James Burgess · Jeffrey J Nirschl · Laura Bravo-Sánchez · Alejandro Lozano · Sanket Rajan Gupte · Jesus G. Galaz-Montoya · Yuhui Zhang · Yuchang Su · Disha Bhowmik · Zachary Coman · Sarina M. Hasan · Alexandra Johannesson · William D. Leineweber · Malvika G Nair · Ridhi Yarlagadda · Connor Zuraski · Wah Chiu · Sarah Cohen · Jan N. Hansen · Manuel D Leonetti · Chad Liu · Emma Lundberg · Serena Yeung

[ ExHall D ]

Abstract
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks target up to college level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,061 multiple-choice questions (MCQs) curated by biological experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. We find that standard MCQ creation methods do not properly test our targeted reasoning capabilities, motivating a new two stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' generates more challenging distractors. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 43%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought reasoning failures indicates that multimodal reasoning errors are frequent, followed by knowledge errors and overgeneralization. These insights …
Poster
Ashmal Vayani · Dinura Dissanayake · Hasindri Watawana · Noor Ahsan · Nevasini Sasikumar · Omkar Thawakar · Henok Biadglign Ademtew · Yahya Hmaiti · Amandeep Kumar · Kartik Kuckreja · Mykola Maslych · Wafa Al Ghallabi · Mihail Minkov Mihaylov · Chao Qin · Abdelrahman Shaker · Mike Zhang · Mahardika Krisna Ihsani · Amiel Gian Esplana · Monil Gokani · Shachar Mirkin · Harsh Singh · Ashay Srivastava · Endre Hamerlik · Fathinah Asma Izzati · Fadillah Adamsyah Maani · Sebastian Cavada · Jenny Chim · Rohit Gupta · Sanjay Manjunath · Kamila Zhumakhanova · Feno Heriniaina Rabevohitra · Azril Hafizi Amirudin · Muhammad Ridzuan · Daniya Najiha Abdul Kareem · Ketan Pravin More · Kunyang Li · Pramesh Shakya · Muhammad Saad · Amirpouya Ghasemaghaei · Amirbek Djanibekov · Dilshod Azizov · Branislava Jankovic · Naman Bhatia · Alvaro Cabrera Berobide · Johan Obando-Ceron · Olympiah Otieno · Fabian Farestam · Muztoba Rabbani · Sanoojan Baliah · Santosh Sanjeev · Abduragim Shtanchaev · Maheen Fatima · Thao Nguyen · Amrin Kareem · Toluwani Aremu · Nathan Augusto Zacarias Xavier · Amit Bhatkal · Hawau Olamide Toyin · Aman Chadha · Hisham Cholakkal · Rao Anwer · Michael Felsberg · Jorma Laaksonen · Thamar Solorio · Monojit Choudhury · Ivan Laptev · Mubarak Shah · Salman Khan · Fahad Shahbaz Khan

[ ExHall D ]

Abstract
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in multimodal research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including True/False, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights …
Poster
Ke Sun · Shen Chen · Taiping Yao · Ziyin Zhou · Jiayi Ji · Xiaoshuai Sun · Chia-Wen Lin · Rongrong Ji

[ ExHall D ]

Abstract
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks.
Poster
Zhicheng Wang · Zhiyu Pan · Zhan Peng · Jian Cheng · Liwen Xiao · Wei Jiang · Zhiguo Cao

[ ExHall D ]

Abstract
Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define the CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervisions, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves 30% error reduction in counting metics and a 10% improvement in localization accuracy. The surprising results shed lights on the significance of contextual attribute density for REC. Code will be available.
Poster
Wenlong Fang · Qiaofeng Wu · Jing Chen · Yun Xue

[ ExHall D ]

Abstract
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our …
Poster
Tianyu Huai · Jie Zhou · Xingjiao Wu · Qin Chen · Qingchun Bai · Zezhou · Liang He

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering. We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs.We introduce a Dual-Router MoE (RMoE) to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach. The codes and weights will be released on GitHub.
Poster
Fan Lu · Wei Wu · Kecheng Zheng · Shuailei Ma · Biao Gong · Jiawei Liu · Wei Zhai · Yang Cao · Yujun Shen · Zheng-Jun Zha

[ ExHall D ]

Abstract
Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs. We will release the code and the dataset to support the community.
Poster
Shuxian Li · Changhao He · XitingLiu · Joey Tianyi Zhou · Xi Peng · Peng Hu

[ ExHall D ]

Abstract
Composed Image Retrieval (CIR) enables editable image search by integrating a query pair—a reference image ref and a textual modification mod—to retrieve a target image tar that reflects the intended change. While existing CIR methods have shown promising performance using well-annotated triplets ref,mod,tar, almost all of them implicitly assume these triplets are accurately associated with each other. In practice, however, this assumption is often violated due to the limited knowledge of annotators, inevitably leading to incorrect textual modifications and resulting in a practical yet less-touched problem: noisy triplet correspondence (NTC). To tackle this challenge, we propose a Task-oriented Modification Enhancement framework (TME) to learn robustly from noisy triplets, which comprises three key modules: Robust Fusion Query (RFQ), Pseudo Text Enhancement (PTE), and Task-Oriented Prompt (TOP). Specifically, to mitigate the adverse impact of noise, RFQ employs a sample selection strategy to divide the training triplets into clean and noisy sets, thus enhancing the reliability of the training data for robust learning. To further leverage the noisy data instead of discarding it, PTE unifies the triplet noise as an adapter mismatch problem, thereby adjusting mod to align with ref and tar in the mismatched triplet. Finally, TOP replaces …
Poster
Eric Xing · Pranavi Kolouju · Robert Pless · Abby Stylianou · Nathan Jacobs

[ ExHall D ]

Abstract
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on the CIRR and CIRCO datasets. Source code, model checkpoints, and our new datasets will be made available upon publication.
Poster
Qiang Zou · Shuli Cheng · Jiayi Chen

[ ExHall D ]

Abstract
Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://anonymous.4open.science/r/PromptHash-8ED3.
Poster
Sungyeon Kim · Xinliang Zhu · Xiaofan Lin · Muhammet Bastan · Douglas Gray · Suha Kwak

[ ExHall D ]

Abstract
Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results comparable to embedding-based methods while preserving efficiency.
Poster
Yingxin Lai · Cuijie Xu · Haitian Shi · Guoqing Yang · Xiaoning Li · Zhiming Luo · Shaozi Li

[ ExHall D ]

Abstract
The rapid development of generative models has significantly advanced the font generation. However, limited explorations have been devoted into the evaluation and interpretability of graphical fonts. Especially, existing quality assessment models can only provide basic visual capabilities, such as recognizing clarity and brightness, lacking in-depth explanation.To address these limitations, we firstly constructed a large-scale multimodal dataset comprising 135,000 font-text pairs named Diversity Font Dataset (DFD). This dataset includes a wide range of generated font types and annotations including language descriptions and quality assessments, providing a strong basis for training and evaluating font analysis models.Based on the dataset, we developed a Vision Language Model (VLM)-based Font-Agent with the aim of improving font quality assessment and offering interpretative question-answering capabilities. Alongside the original visual encoder in VLM, we integrated a Edge Aware Traces (EAT) Module to capture detailed edge information of font strokes and components. Furthermore, we introduce a Dynamic Direct Preference Optimization (D-DPO) strategy to facilitate efficient model fine-tuning. Experimental outcomes showcase that Font-Agent achieves state-of-the-art performance on the the established dataset. To further assess the generalization of our algorithm, we conducted evaluation on several public datasets. The results highlight the notable advantage of Font-Agent in both assessing the quality of …
Poster
Bin Wang · Fan Wu · Linke Ouyang · Zhuangcheng Gu · Rui Zhang · Renqiu Xia · Botian Shi · Bo Zhang · Conghui He

[ ExHall D ]

Abstract
Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
Poster
Arun Reddy · Alexander Martin · Eugene Yang · Andrew Yates · Kate Sanders · Kenton Murray · Reno Kriz · Celso M. de Melo · Benjamin Van Durme · Rama Chellappa

[ ExHall D ]

Abstract
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
Poster
Jingfeng Yao · Bin Yang · Xinggang Wang

[ ExHall D ]

Abstract
Latent diffusion models (LDM) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: increasing the per-token feature dimension in visual tokenizers improves reconstruction quality but requires substantially larger diffusion models and extended training time to maintain generation performance. This results in prohibitively high computational costs, making high-dimensional tokenizers impractical. In this paper, we argue that this limitation stems from the inherent difficulty of learning unconstrained high-dimensional latent spaces and address this limitation by aligning the latent space with pre-trained vision foundation models. Our VA-VAE (Vision foundation model Aligned Variational AutoEncoder) expands the Pareto frontier of visual tokenizers, enabling 2.7 times faster Diffusion Transformers (DiT) convergence in high-dimensional latent space. To further validate our approach, we optimize a DiT baseline, referred to as LightningDiT, achieving superior performance on class conditional generation with only 6% of the original training epochs. The integrated system demonstrates the effectiveness of VA-VAE, achieving 0.28 rFID and 1.73 gFID on ImageNet-256 generation in 400 epochs—outperforming the original DiT's 0.71 rFID and 2.27 gFID in 1400 epochs, without more complex designs. To our knowledge, this marks the first latent diffusion system to achieve both superior generation and reconstruction …
Poster
Leqi Shen · Guoqiang Gong · Tianxiang Hao · Tao He · Yifeng Zhang · Pengzhang Liu · Sicheng Zhao · Jungong Han · Guiguang Ding

[ ExHall D ]

Abstract
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 2.2% R@1 and 7.5% R@sum. Our code will be made available.
Poster
Siyuan Li · Luyuan Zhang · Zedong Wang · Juanxi Tian · Cheng Tan · Zicheng Liu · Chang Yu · Qingsong Xie · Haonan Lu · Haoqian Wang · Zhen Lei

[ ExHall D ]

Abstract
Masked Image Modeling (MIM) with Vector Quantization (VQ) have achieved great success in self-supervised pre-training and image generation. However, the most methods could not solve two technical challenges simultaneously: (1) The gradient approximation in classical VQ (e.g., VQGAN) sets a optimization bottleneck for the codebook and encoder; (2) Trade-off in the kept token lengths for generation quality vs. representation learning and efficiency. To unveil the full power of this learning paradigm, this paper introduces a unified framework that achieves both visual representation learning and generation in an auto-encoding way with token merging and vector quantization, called MergeVQ. As for pre-training, we merge the full spatial tokens into top-k semantic ones with the token merge module after self-attention blocks in the encoder and recovery the positions of the merged tokens with cross-attention blocks in the decoder. As for generation, we quantize the selected top-k tokens with Look-up Free Quantization (LFQ) with no training bottleneck and trade-off the computational overheads and qualities for generation. Extensive experiments on ImageNet verify that our MergeVQ achieves both performance gains and speeding up across image generation and self-supervised pre-training scenarios.
Poster
Alejandro Lozano · Min Woo Sun · James Burgess · Liangyu Chen · Jeffrey J Nirschl · Jeffrey Gu · Ivan Lopez · Josiah Aklilu · Austin Wolfgang Katzer · Collin Chiu · Anita Rau · Xiaohan Wang · Yuhui Zhang · Alfred Seunghoon Song · Robert Tibshirani · Serena Yeung

[ ExHall D ]

Abstract
The development of vision-language models (VLMs) is driven by large-scale and diverse multi-modal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are limited to narrow domains, missing the opportunity to leverage the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA: a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are additionally provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-LIP, a suite of CLIP-style models continuously pre-trained on BIOMEDICA dataset via streaming (eliminating the need to download 27 TB of data locally). On average, our models achieve state-of-the-art performance across 40 tasks — spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology — excelling in zero-shot classification with 5.57% average improvement (as high as 26.93% and 17.63% gains in surgery and ophthalmology, respectively) and stronger image-text retrieval while using 10x less compute.
Poster
XuDong Wang · Xingyi Zhou · Alireza Fathi · Trevor Darrell · Cordelia Schmid

[ ExHall D ]

Abstract
We present Visual Lexicon, an image representation that encodes visual information in the text space while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex captures both rich semantic content and fine visual details, facilitating high-quality image generation and visual scene understanding. Using a self-supervised learning pipeline, ViLex generates embeddings optimized for reconstructing input images through a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic level reconstruction. As visual embeddings in the text space, ViLex embeddings can be used independently as text tokens or combined with natural language tokens for zero-shot multimodal image generation. ViLex is also compatible with downstream vision-language tasks like visual question answering and referring expression segmentation, significantly enhancing performance. Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text-based embeddings—even with a single token. ViLex also performs various DreamBooth tasks in a zero-shot manner without the need for fine-tuning T2I models, and serves as a powerful vision encoder, consistently enhancing vision-language model performance across 15 benchmarks compared to a strong SigLIP baseline.
Poster
Fiona Ryan · Josef Sivic · Fabian Caba Heilbron · Judy Hoffman · James Rehg · Bryan Russell

[ ExHall D ]

Abstract
Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido'') from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries -- DeepFashion2 and ConConChi -- outperforming the prior art by 4%-22% on personal retrievals.
Poster
Jingyi Xie · Jintao Yang · Zhunchen Luo · Yunbo Cao · Qiang Gao · Mengyuan Zhang · Wenpeng Hu

[ ExHall D ]

Abstract
Adapting Multi-modal Large Language Models (MLLMs) to target tasks often suffers from catastrophic forgetting, where acquiring new task-specific knowledge compromises performance on pre-trained tasks. In this paper, we introduce AdaDARE-γ, an efficient approach that alleviates catastrophic forgetting by controllably injecting new task-specific knowledge through adaptive parameter selection from fine-tuned models without requiring retraining procedures. This approach consists two key innovations: (1) an adaptive parameter selection mechanism that identifies and retains the most task-relevant parameters from fine-tuned models, and (2) a controlled task-specific information injection strategy that precisely balances the preservation of pre-trained knowledge with the acquisition of new capabilities. Theoretical analysis proves the optimality of our parameter selection strategy and establishes bounds for the task-specific information injection factor. Extensive experiments on InstructBLIP and LLaVA-1.5 across image captioning and visual question answering tasks demonstrate that AdaDARE-γ establishes new state-of-the-art results in balancing model performance. Specifically, it maintains 98.2\% of pre-training effectiveness on original tasks while achieving 98.7\% of standard fine-tuning performance on target tasks.
Poster
Pavan Vasu Vasu · Fartash Faghri · Chun-Liang Li · Cem Koc · Nate True · Gokula Krishnan Santhanam · Albert Antony · James Gabriel · Peter Grasch · Oncel Tuzel · Hadi Pouransari

[ ExHall D ]

Abstract
Vision Language Models (VLMs) like LLaVA encode images into tokens aligned to the word embedding space of the LLM decoder. Scaling input image resolution is essential for improving performance, especially in text-rich image understanding tasks. However, popular visual encoders such as CLIP-pretrained ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD—a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. We provide a comprehensive efficiency analysis of the interplay between image resolution, vision latency, number of visual tokens, and LLM size. In the LLaVA-1.5 setup, we achieve 3.2× improvement in overall time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. On text-rich evaluations like TextVQA and DocVQA, FastVLM obtains +8.4\% and +12.5\% better accuracy than ConvLLaVA at a similar operating point of …
Poster
Zhi Zhang · Srishti Yadav · Fengze Han · Ekaterina Shutova

[ ExHall D ]

Abstract
The recent advancements in auto-regressive multi-modal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities---language and vision---in MLLMs, focusing on visual question answering.Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the …
Poster
Senqiao Yang · Yukang Chen · Zhuotao Tian · Chengyao Wang · Jingyao Li · Bei Yu · Jiaya Jia

[ ExHall D ]

Abstract
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs.However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform.Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5\% performance gains across nearly all settings.Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8× and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results.Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. All code and models will be publicly available.
Poster
Cheng Yang · Yang Sui · Jinqi Xiao · Lingyi Huang · Yu Gong · Chendi Li · Jinghua Yan · Yu Bai · Ponnuswamy Sadayappan · Xia Hu · Bo Yuan

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens required to represent visual information. Previous studies have observed that visual tokens often receive less attention than other tokens, such as system and instruction tokens, highlighting their lower relative importance during VLM inference and then pruning redundant visual tokens. However, previous approaches to token pruning encounter several challenges: reliance on heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce TopV, a compatible Token Pruning with inference Time Optimization for fast and low-memory VLM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores as the importance metric in the previous works, we formulate token pruning as an optimization problem, allowing us to accurately identify important visual tokens. By avoiding the need for attention scores, our approach maintains compatibility with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates several critical components. First, given the to-be-pruned source tokens, we investigate the appropriate positions of target tokens within the VLM layer. Then, we define a visual-aware cost function …
Poster
Wangbo Zhao · Yizeng Han · Jiasheng Tang · Zhikai Li · Yibing Song · Kai Wang · Zhangyang Wang · Yang You

[ ExHall D ]

Abstract
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains performance under aggressive pruning. However, it requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the aggregated attention map from a small VLM guide the pruning of visual tokens in a large VLM. Additionally, we develop a small VLM early exiting mechanism to make full use of the small VLM's predictions, dynamically invoking the …
Poster
Souhail Hadgi · Luca Moschella · Andrea Santilli · Diego Gomez · Qixing Huang · Emanuele Rodolà · Simone Melzi · Maks Ovsjanikov

[ ExHall D ]

Abstract
Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data …
Poster
Yahan Tu · Rui Hu · Jitao Sang

[ ExHall D ]

Abstract
Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.
Poster
jiajun cao · Yuan Zhang · Tao Huang · Ming Lu · Qizhe Zhang · Ruichuan An · Ningning Ma · Shanghang Zhang

[ ExHall D ]

Abstract
Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. The code will be released.
Poster
Jiazhen Liu · Yuhan Fu · Ruobing Xie · Runquan Xie · Xingwu Sun · Fengzong Lian · Zhanhui Kang · Xirong Li

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent …
Poster
Yongting Zhang · Lu Chen · Guodong Zheng · Yifeng Gao · Rui Zheng · Jinlan Fu · Zhenfei Yin · Senjie Jin · Yu Qiao · Xuanjing Huang · Feng Zhao · Tao Gui · Jing Shao

[ ExHall D ]

Abstract
The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The construction of preference data is fully automated, and the experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness.
Poster
Yanbo Wang · Jiyang Guan · Jian Liang · Ran He

[ ExHall D ]

Abstract
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability.However, it is unknown what is actually learned in the high-quality dataset.Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction fine-tuning. Simply correcting the underlying …
Poster
Shuyang Hao · Bryan Hooi · Jun Liu · Kai-Wei Chang · Zi Huang · Yujun Cai

[ ExHall D ]

Abstract
Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75\% on MiniGPT-4 and 82.80\% on LLaVA-2, substantially outperforming existing methods by margins of 34.37\% and 12.77\% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11\% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.
Poster
Jiaming Zhang · Junhong Ye · Xingjun Ma · Yige Li · Yunfan Yang · Yunhao Chen · Jitao Sang · Dit-Yan Yeung

[ ExHall D ]

Abstract
Due to their multimodal capabilities, Vision-Language Models (VLMs) have found numerous impactful applications in real-world scenarios. However, recent studies have revealed that VLMs are vulnerable to image-based adversarial attacks, particularly targeted adversarial images that manipulate the model to generate harmful content specified by the adversary. Current attack methods rely on predefined target labels to create targeted adversarial attacks, which limits their scalability and applicability for large-scale robustness evaluations. In this paper, we propose **AnyAttack**, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision, allowing **any** image to serve as a target for the **attack**.Our framework employs the "pre-training and fine-tuning" paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset.This large-scale pre-training endows our method with powerful transferability across a wide range of VLMs.Extensive experiments on five mainstream open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks (image-text retrieval, multimodal classification, and image captioning) demonstrate the effectiveness of our attack.Additionally, we successfully transfer AnyAttack to multiple commercial VLMs, including Google Gemini, Claude Sonnet, Microsoft Copilot and OpenAI GPT.These results reveal an unprecedented risk to VLMs, highlighting the need for effective countermeasures.
Poster
Xin Wang · Kai Chen · Jiaming Zhang · Jingjing Chen · Xingjun Ma

[ ExHall D ]

Abstract
Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9\% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6\%.
Poster
Baoshun Tong · Hanjiang Lai · Yan Pan · Jian Yin

[ ExHall D ]

Abstract
Pre-trained Vision-Language Models (VLMs) like CLIP, have demonstrated strong zero-shot generalization capabilities. Despite their effectiveness on various downstream tasks, they remain vulnerable to adversarial samples. Existing methods fine-tune VLMs to improve their performance via performing adversarial training on a certain dataset. However, this can lead to model overfitting and is not a true zero-shot scenario. In this paper, we propose a truly zero-shot and training-free approach that can significantly improve the VLM's zero-shot adversarial robustness. Specifically, we first discover that simply adding Gaussian noise greatly enhances the VLM's zero-shot performance. Then, we treat the adversarial examples with added Gaussian noise as anchors and strive to find a path in the embedding space that leads from the adversarial examples to the cleaner samples. We improve the VLMs' generalization abilities in a truly zero-shot and training-free manner compared to previous methods. Extensive experiments on 16 datasets demonstrate that our method can achieve state-of-the-art zero-shot robust performance, improving the top-1 robust accuracy by an average of 9.77%. The code will be publicly available.
Poster
Julio Silva-Rodríguez · Ismail Ben Ayed · Jose Dolz

[ ExHall D ]

Abstract
Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three popular non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being ×15 faster than popular transductive approaches.
Poster
Ashshak Sharifdeen · Muhammad Akhtar Munir · Sanoojan Baliah · Salman Khan · Muhammad Haris Khan

[ ExHall D ]

Abstract
Test-time prompt tuning for vision-language models (VLMs) are getting attention due to their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMsTowards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion.We conduct extensive experiments on various datasets with different backbones and baselines. Results indicate that our method consistently outperforms the state-of-the-art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks. Our code will be made public upon acceptance.
Poster
Yicheng Chen · Xiangtai Li · Yining Li · Yanhong Zeng · Jianzong Wu · Xiangyu Zhao · Kai Chen

[ ExHall D ]

Abstract
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.However, leveraging these models to boost performance on downstream tasks with synthetic data poses several challenges, including aligning with real data distribution, scaling synthetic sample volumes, and ensuring their quality. To bridge these gaps, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality cross-modality training samples at scale to augment perception and multi-modal training. ACP first uses LLMs to sample descriptions and layouts based on object combinations from real data priors, eliminating the need for ground truth image captions or annotations. Next, we use an off-the-shelf controllable diffusion model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric, Composite Layout and Image Score (CLIS), to ensure quality. Our customized synthetic high-quality samples boost performance in various scenarios, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that ACP can significantly improve the performance of existing models. In addition, we find a positive correlation between CLIS and performance gains in downstream tasks. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. …
Poster
Amir Bar · Gaoyue Zhou · Danny Tran · Trevor Darrell · Yann LeCun

[ ExHall D ]

Abstract
Navigation is a fundamental skill of agents with visual-motor capabilities. We propose a Navigation World Model (NWM), a controllable video generation model that predicts the future visual observation given the past observations and navigation actions. NWM is a Conditional Diffusion Transformer (CDiT) trained on the video footage of robots as well as unlabeled egocentric video data. We scale the model up to 1B parameters and train it over human and robot agents data from numerous environments and embodiments. Our model scales favorably on known and unknown environments and can leverage unlabeled egocentric video data. NWM exhibits improved navigation planning skills either by planning from scratch or by ranking proposals from an external navigation policy. Compared to existing supervised navigation models which are hard coded'', NWM can incorporate new constraints when planning trajectories. NWM learns visual priors that enable it to imagine navigation trajectories based on just a single input image.
Poster
Bikang Pan · Qun Li · Xiaoying Tang · Wei Huang · Zhen Fang · Feng Liu · Jingya Wang · Jingyi Yu · Ye Shi

[ ExHall D ]

Abstract
The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive …
Poster
Long Tung Vuong · Hoang Phan · Vy Vo · Anh Tuan Bui · Thanh-Toan Do · Trung Le · Dinh Phung

[ ExHall D ]

Abstract
Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains deviates from visual embedding distribution pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforcing these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we then transform this insight into a novel strategy to enforce the clustering property in text embeddings. Our experiments and ablation studies validate the effectiveness of the proposed …
Poster
Tianyu Yu · Haoye Zhang · Qiming Li · Qixin Xu · Yuan Yao · Da Chen · Xiaoman Lu · Ganqu Cui · Yunkai Dang · Taiwen He · Xiaocheng Feng · Jun Song · Bo Zheng · Zhiyuan Liu · Tat-seng Chua · Maosong Sun

[ ExHall D ]

Abstract
Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models.This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs.In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation for preference learning and self-feedback guidance for inference-time scaling.Extensive experiments on seven benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models at both preference learning and inference time. RLAIF-V 7B reduces object hallucination by 80.7\% and overall hallucination by 33.7\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of open-source MLLMs, where the model can learn from feedback of itself to achieve super GPT-4V trustworthiness.
Poster
Jiahao Xie · Alessio Tonioni · Nathalie Rauschmayr · Federico Tombari · Bernt Schiele

[ ExHall D ]

Abstract
Visual in-context learning (ICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing visual ICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can learn adaptive visual ICL models on the fly with a single test sample. Specifically, We flip the role between task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks with 15 corruptions demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. Code will be released to facilitate future research.
Poster
Pramit Saha · Felix Wagner · Divyanshu Mishra · Can Peng · Anshul Thakur · David A. Clifton · Konstantinos Kamnitsas · Alison Noble

[ ExHall D ]

Abstract
Effective training of large Vision-Language Models (VLMs) on resource-constrained client devices in Federated Learning (FL) requires the usage of parameter-efficient fine-tuning (PEFT) strategies. To this end, we demonstrate the impact of two factors \textit{viz.}, client-specific layer importance score that selects the most important VLM layers for fine-tuning and inter-client layer diversity score that encourages diverse layer selection across clients for optimal VLM layer selection. We first theoretically motivate and leverage the principal eigenvalue magnitude of layerwise Neural Tangent Kernels and show its effectiveness as client-specific layer importance score. Next, we propose a novel layer updating strategy dubbed \textbf{F3OCUS} that jointly optimizes the layer importance and diversity factors by employing a data-free, multi-objective, meta-heuristic optimization on the server. We explore 5 different meta-heuristic algorithms and compare their effectiveness for selecting model layers and adapter layers towards PEFT-FL. Furthermore, we release a new MedVQA-FL dataset involving overall 707,962 VQA triplets and 9 modality-specific clients and utilize it to train and evaluate our method. Overall, we conduct more than 10,000 client-level experiments on 6 Vision-Language FL task settings involving 58 medical image datasets and 4 different VLM architectures of varying sizes to demonstrate the effectiveness of the proposed method.
Poster
Arne Grobrügge · Niklas Kühl · Gerhard Satzger · Philipp Spitzer

[ ExHall D ]

Abstract
Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model's decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD's explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD paired with the completeness relation enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. Code will be available for research purposes upon acceptance.
Poster
Jinhong Lin · Cheng-En Wu · Huanran Li · Jifan Zhang · Yu Hen Hu · Pedro Morgado

[ ExHall D ]

Abstract
Masked Image Modeling (MIM) has emerged as a powerful self-supervised learning paradigm for visual representation learning, enabling models to acquire rich visual representations by predicting masked portions of images from their visible regions. While this approach has shown promising results, we hypothesize that its effectiveness may be limited by optimization challenges during early training stages, where models are expected to learn complex image distributions from partial observations before developing basic visual processing capabilities. To address this limitation, we propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our approach introduces a temperature-based annealing scheme that gradually expands the training distribution, enabling more stable and efficient learning trajectories. Through extensive experiments on ImageNet-1K, we demonstrate that our curriculum learning strategy significantly improves both training efficiency and representation quality while requiring substantially fewer training epochs compared to standard Masked Auto-Encoding. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning, providing a practical solution to the early-stage optimization challenges in MIM.
Poster
Yancheng Cai · Fei Yin · Dounia Hammou · Rafal Mantiuk

[ ExHall D ]

Abstract
Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on …
Poster
Duolikun Danier · Mehmet Aygun · Changjian Li · Hakan Bilen · Oisin Mac Aodha

[ ExHall D ]

Abstract
Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
Poster
Zhaoqing Wang · Xiaobo Xia · Runnan Chen · Dongdong Yu · Changhu Wang · Mingming Gong · Tongliang Liu

[ ExHall D ]

Abstract
This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be …
Poster
Dongshuo Yin · Leiyi Hu · Bin Li · Youqun Zhang · Xue Yang

[ ExHall D ]

Abstract
Pre-training & fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like object detection and segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability for processing visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled layernorm in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, oriented object detection on DOTA/STAR, and image classification on three common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks by tuning less than 5% params of the backbone, and is the only delta-tuning method outperforming full fine-tuning on all tasks. For example, Mona achieves 1% performance gain on the COCO dataset …
Poster
Uranik Berisha · Jens Mehnert · Alexandru Paul Condurache

[ ExHall D ]

Abstract
Vision Transformers (ViTs) have emerged as the state-of-the-art models in various Computer Vision (CV) tasks, but their high computational and resource demands pose significant challenges. While Mixture of Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model’s MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing FLOPs and model size, by up to 36% and 32% respectively.
Poster
Lixu Wang · Bingqi Shang · Yi Li · Payal Mohapatra · Wei Dong · Xiao Wang · Qi Zhu

[ ExHall D ]

Abstract
Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot …
Poster
Jialai Wang · Yuxiao Wu · Weiye Xu · Yating Huang · Chao Zhang · Zongpeng Li · Mingwei Xu · Zhenkai Liang

[ ExHall D ]

Abstract
Vision Transformers (ViTs) have experienced significant progress and are quantized for deployment in resource-constrained applications. Quantized models are vulnerable to targeted bit-flip attacks (BFAs). A targeted BFA prepares a trigger and a corresponding Trojan/backdoor, inserting the latter (with RowHammer bit flipping) into a victim model, to mislead its classification on samples containing the trigger. Existing targeted BFAs on quantized ViTs are limited in that: (1) they require numerous bit-flips, and (2) the separation between flipped bits is below 4 KB, making attacks infeasible with RowHammer in real-world scenarios. We propose a new and practical targeted attack Flip-S against quantized ViTs. The core insight is that in quantized models, a scale factor change ripples through a batch of model weights. Consequently, flipping bits in scale factors, rather than solely in model weights, enables more cost-effective attacks. We design a Scale-Factor-Search (SFS) algorithm to identify critical bits in scale factors for flipping, and adopt a mutual exclusion strategy to guarantee a 4 KB separation between flips. We evaluate Flip-S on CIFAR-10 and ImageNet datasets across five ViT architectures and two quantization levels. Results show that Flip-S achieves attack success rate (ASR) exceeding 90.0\% on all models with 50 bits flipped, outperforming baselines …
Poster
Xinglong Sun · Barath Lakshmanan · Maying Shen · Shiyi Lan · Jingde Chen · Jose M. Alvarez

[ ExHall D ]

Abstract
Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning(MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities—including channels, query/key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28\% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37\% acceleration with a +0.7 Top-1 accuracy improvement.
Poster
Fei Xie · Jiahao Nie · Yujin Tang · Wenkang Zhang · Hongshen Zhao

[ ExHall D ]

Abstract
Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task adapter for Mamba models, which consists of two functional modules: Adaptor-T and Adapator-S. When solving the hidden states for SSM, we apply a casual prediction module Adaptor-T to select a set of learnable locations as memory augmentation feature states to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both two modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that …
Poster
Yuan Zhou · Qingshan Xu · Jiequan Cui · Junbao Zhou · Jing Zhang · Richang Hong · Hanwang Zhang

[ ExHall D ]

Abstract
Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new deCoupled duAl-interactive lineaR attEntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving 78.4/82.1% top-1 accuracy on ImagegNet-1K at the cost of only 0.7/1.9 GMACs. Codes will be released on github.
Poster
Zichen Miao · WEI CHEN · Qiang Qiu

[ ExHall D ]

Abstract
Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods …
Poster
Caoshuo Li · Tanzhe Li · Xiaobin Hu · Donghao Luo · Taisong Jin

[ ExHall D ]

Abstract
Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed **D**ilated **V**ision **H**yper**G**raph **N**eural **N**etwork (DVHGNN), which is designed to leverage multi-scale hypergraph to *efficiently* capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and **D**ilated **H**yper**G**raph **C**onstruction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of **83.1\%** on ImageNet-1K, surpassing ViG-S by **+1.0** and ViHGNN-S by **+0.6**.
Poster
Ruiheng Liu · Haozhe Chen · Boyao Zhao · Kejiang Chen · Weiming Zhang

[ ExHall D ]

Abstract
The advancement of AI technology has significantly influenced production activities, increasing the focus on copyright protection for AI models. Model perceptual hashing offers an efficient solution for retrieving the pirated models. Existing methods, such as handcrafted feature-based and dual-branch network-based perceptual hashing, have proven effective in detecting pirated models. However, these approaches often struggle to differentiate non-pirated models, leading to frequent false positives in model authentication and protection. To address this challenge, this paper proposes a structurally-aware perceptual model hashing technique that achieved reduced false positives while maintaining high true positive rates. Specifically, we introduce a method for converting the diverse neural network structures into graph structures suitable for DNN processing, then utilize a graph neural network to learn their structural features representation. Our approach integrates perceptual parameter-based model hashing, achieving robust performance with higher detection accuracy and fewer false positives. Experimental results show that the proposed method has only 3\% false positive rate when detecting the non-pirated model, and the detection accuracy of pirated model reaches more than 98\%.
Poster
Yang Liu · Tianwei Zhang · Shi Gu

[ ExHall D ]

Abstract
Concept Bottleneck Models (CBMs) provide an interpretable framework for neural networks by mapping visual features to predefined, human-understandable concepts. However, the application of CBMs is often constrained by insufficient concept annotations. Recently, multi-modal pre-trained models have shown promise in reducing annotation costs by aligning visual representations with textual concept embeddings. Nevertheless, the quality and completeness of the predefined concepts significantly affect the performance of CBMs.In this work, we propose Hybrid Concept Bottleneck Model (HybridCBM), a novel CBM framework to address the challenge of incomplete predefined concepts. Our method consists of two main components: a Static Concept Bank and a Dynamic Concept Bank. The Static Concept Bank directly leverages large language models (LLMs) for concept construction, while the Dynamic Concept Bank employs learnable vectors to capture complementary and valuable concepts continuously during training. After training, a pre-trained translator converts these vectors into human-understandable concepts, further enhancing model interpretability. Notably, HybridCBM is highly flexible and can be easily applied to any CBM to improve performance. Experimental results across multiple datasets demonstrate that HybridCBM outperforms current state-of-the-art methods and achieves comparable results to black-box models. Additionally, we propose novel metrics to evaluate the quality of the learned concepts, showing that they perform comparably …
Poster
Sanghyun Kim · Deunsol Jung · Minsu Cho

[ ExHall D ]

Abstract
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings.However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions.To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations.The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches.The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object.By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs.Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
Poster
Dianmo Sheng · Dongdong Chen · Zhentao Tan · Qiankun Liu · Qi Chu · Tao Gong · Bin Liu · Jing Han · Wenbin Tu · Shengwei Xu · Nenghai Yu

[ ExHall D ]

Abstract
Recent advancements in in-context segmentation generalists have demonstrated significant success in performing various image segmentation tasks using a limited number of labeled example images. However, real-world applications present challenges due to the variability of support examples, which often exhibit quality issues resulting from various sources and inaccurate labeling. How to extract more robust representations from these examples has always been one of the goals of in-context visual learning. In response, we propose UNICL-SAM, to better model the example distribution and extract robust representations to help in-context segmentation. We incorporate an uncertainty probabilistic module to quantify each example's reliability during both the training and testing phases. Utilizing this uncertainty estimation, we introduce an uncertainty-guided graph augmentation and feature refinement strategy, aimed at mitigating the impact of high-uncertainty regions to enhance the learning of robust representations. Subsequently, we construct prototypes for each example by aggregating part information, thereby creating reliable in-context instruction that effectively represents fine-grained local semantics. This approach serves as a valuable complement to traditional global pooling features. Experimental results demonstrate the effectiveness of the proposed framework, underscoring its potential for real-world applications.
Poster
ZhengYang Wang · Tingliang Feng · Fan Lyu · Fanhua Shang · Wei Feng · Liang Wan

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation aims to enable models to segment arbitrary categories. Currently, though pre-trained Vision-Language Models (VLMs) like CLIP have established a robust foundation for this task by learning to match text and image representations from large-scale data, their lack of pixel-level recognition necessitates further fine-tuning. Most existing methods leverage text as a guide to achieve pixel-level recognition. However, the inherent biases in text semantic descriptions and the lack of pixel-level supervisory information make it challenging to fine-tune CLIP-based models effectively. This paper considers leveraging image-text data to simultaneously capture the semantic information contained in both image and text, thereby constructing Dual Semantic Guidance and corresponding pixel-level pseudo annotations. Particularly, the visual semantic guidance is enhanced via explicitly exploring foreground regions and minimizing the influence of background. The dual semantic guidance is then jointly utilized to fine-tune CLIP-based segmentation models, achieving decent fine-grained recognition capabilities. As the comprehensive evaluation shows, our method outperforms state-of-art results with large margins, on eight commonly used datasets with/without background.
Poster
Zhiwei Yang · Yucong Meng · Kexue Fu · feilong tang · Shuo Wang · Zhijian Song

[ ExHall D ]

Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image-text alignment for CAM generation, while CLIP’s potential in patch-text alignment remains under-explored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch-text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset-wide knowledge base and enriches the text representations with an implicit attribute-hunting process. To mine fine-grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine-grained knowledge in a non-parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP’s training-free advantages but also significantly outperforms other state-of-the-art methods with much less training cost on PASCAL VOC and MS COCO. Codes will be available.
Poster
Chen Yi Lu · Kasra Derakhshandeh · Somali Chaterji

[ ExHall D ]

Abstract
Semi-supervised semantic segmentation with consistencyregularization capitalizes on unlabeled images to enhancethe accuracy of pixel-level segmentation. Current consistencylearning methods primarily rely on the consistency loss be-tween pseudo-labels and unlabeled images, neglecting the in-formation within the feature representations of the backboneencoder. Preserving maximum information in feature embed-dings requires achieving the alignment and uniformity objec-tives, as widely studied. To address this, we present SWSEG,a semi-supervised semantic segmentation algorithm that opti-mizes alignment and uniformity using the Sliced-WassersteinDistance (SWD), and rigorously and empirically proves thisconnection. We further resolve the computational issues as-sociated with conventional Monte Carlo-based SWD by im-plementing a Gaussian-approximated variant, which not onlymaintains the alignment and uniformity objectives but alsoimproves training efficiency. We evaluate SWSEG on thePASCAL VOC 2012, Cityscapes, and ADE20K datasets, out-shining supervised baselines in mIoU by up to 11.8%, 8.9%,and 8.2%, respectively, given an equivalent number of labeledsamples. Further, SWSEG surpasses state-of-the-art methodsin multiple settings across these three datasets. Our extensiveablation studies confirm the optimization of the uniformityand alignment objectives of the feature representations.
Poster
Zhongwen Zhang · Yuri Boykov

[ ExHall D ]

Abstract
We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by introducing hidden pseudo-labels and powerful CRF sub-problem solvers, e.g. graph cut. However, previously used hard pseudo-labels can not represent class uncertainty or errors, which motivates soft self-labeling. We derive a principled auxiliary loss and systematically evaluate standard and new CRF relaxations (convex and non-convex), neighborhood systems, and terms connecting network predictions with soft pseudo-labels. We also propose a general continuous sub-problem solver. Using only standard architectures, soft self-labeling consistently improves scribble-based training and outperforms significantly more complex specialized WSSS systems. It can outperform full pixel-precise supervision. Our general ideas apply to other weakly-supervised problems/systems.
Poster
Zhaochen Liu · Limeng Qiao · Xiangxiang Chu · Lin Ma · Tingting Jiang

[ ExHall D ]

Abstract
Aiming to predict the complete shape of partially occluded objects, amodal segmentation is an important capacity towards visual intelligence. In order to promote the practicability, zero-shot foundation model competent for the open world gains growing attention in this field. Nevertheless, prior models exhibit deficiencies in efficiency and stability. To address this problem, utilizing the implicit prior knowledge, we propose the first SAM-based amodal segmentation foundation model, SAMBA. Methodologically, a novel framework with multilevel facilitation is designed to better adapt the task characteristics and unleash the potential capabilities of SAM. In the modality level, a separation-to-fusion structure is employed that jointly learns modal and amodal segmentation to enhance mutual coordination. In the instance level, to ease the complexity of amodal feature extraction, we introduce a principal focusing mechanism to indicate objects of interest. In the pixel level, mixture-of-experts is incorporated with a specialized distribution loss, by which distinct occlusion rates correspond to different experts to improve the accuracy. Experiments are conducted on several eminent datasets, and the results show that the performance of SAMBA is superior to existing zero-shot and even supervised approaches. Furthermore, our proposed model has notable advantages in terms of speed and size. The model and code will …
Poster
Dongkai Wang · Jiang Duan · Liangjian Wen · Shiyu Xuan · Hao CHEN · Shiliang Zhang

[ ExHall D ]

Abstract
Generalizable object keypoint localization is a fundamental computer vision task in understanding the object structure. It is challenging for existing keypoint localization methods because their limited training data cannot provide generalizable shape and semantic cues, leading to inferior performance and generalization capability. Instead of relying on large scale training data, this work tackles this challenge by exploiting the rich priors from large generative models. We propose a data-efficient generalizable localization method named GenLoc. GenLoc extracts the generative priors from a pre-trained image generation model by calculating the correlation map between image latent feature and condition embedding. Those priors are hence optimized with our proposed heatmap expectation loss to perform object keypoint localization. Benefited by the rich knowledge of generative priors in understanding of object semantics and structures, GenLoc achieves superior performance on various object keypoint localization benchmarks. It shows more substantial performance enhancements in cross-domain, few-shot and zero-shot evaluation settings, e.g., getting 20\%+ AP enhancement over CLAMP in various zero-shot settings.
Poster
Huajie Jiang · Zhengxian Li · Xiaohan Yu · Yongli Hu · Baocai Yin · Jian Yang · Yuankai Qi

[ ExHall D ]

Abstract
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visual-semantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.
Poster
Libiao Chen · Dong Nie · Junjun Pan · Jing Yan · Zhenyu Tang

[ ExHall D ]

Abstract
Generalized Zero-Shot Learning (GZSL) addresses the challenge of classifying unseen classes in the presence of seen classes by leveraging semantic attributes to bridge the gap for unseen classes. However, in image based disease classification, such as glioma sub-typing, distinguishing between classes using image semantic attributes can be challenging. To address this challenge, we introduce a novel GZSL method that eliminates the dependency on semantic information. Specifically, we propose that the primary of most classification in clinic is risk stratification, and classes are inherently ordered rather than purely categorical. Based on this insight, we present an inter-class feature augmentation (IFA) module, where distributions of different classes are ordered by their risk levels in a learned feature space using pre-defined conditional Gaussian distribution model. This ordering enables the generation of unseen class features through feature mixing of adjacent seen classes, effectively transforming the zero-shot learning problem into a supervised learning task. Our method eliminates the need for explicit semantic information, avoiding the potential domain shift between visual and semantic features. Moreover, the IFA module provides a simple yet effective solution for zero-shot classification, requiring no structural modifications to the existing classification models. In the experiment, both in-house and public datasets are used …
Poster
Enguang Wang · Zhimao Peng · Zhengyuan Xie · Fei Yang · Xialei Liu · Ming-Ming Cheng

[ ExHall D ]

Abstract
Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes.Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. As a different modality, text information can provide complementary discriminative information, which motivates us to introduce it into the GCD task.However, the lack of class names for unlabelled data makes it impractical to utilize text information.To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP’s text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visualand semantic information mutually enhance each other,promoting the interaction and fusionof visual and text knowledge.Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by alarge margin on all GCD benchmarks, achieving new state-of-the-art.
Poster
Chang-Bin Zhang · Jinhong Ni · Yujie Zhong · Kai Han

[ ExHall D ]

Abstract
In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, e.g., texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called View-Consistent leaRning (VCR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In VCR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our VCR on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance.
Poster
Muli Yang · Gabriel James Goenawan · Huaiyuan Qin · Kai Han · Xi Peng · Yanhua Yang · Hongyuan Zhu

[ ExHall D ]

Abstract
Despite being trained on massive data, today's vision foundation models still fall short in detecting open world objects. Apart from recognizing known objects appeared in training, a successful Open World Object Detection (OWOD) system must also be able to detect unknown objects that were never seen before, without confusing them with the backgrounds. Unlike the prevailing former works that learn "objectness" using probability models, we focus on learning fine-grained class-agnostic attributes that can be used to detect both known and unknown object classes in an explainable manner. In this paper, we propose Partial Attribute Assignment (PASS), aiming to automatically select and optimize a small, relevant subset of attributes from a larger attribute pool. Specifically, we model attribute selection as a Partial Optimal Transport (POT) problem between known visual objects and the attribute pool, in which more relevant attributes signify more transported mass. PASS follows a curriculum schedule that progressively selects and optimizes a targeted subset of attributes during training, promoting stability and accuracy. Our method enjoys end-to-end optimization by minimizing the POT distance and the classification loss on known visual objects, demonstrating high training efficiency and superior OWOD performance among extensive experimental evaluations. Our code will be made public.
Poster
Jiangyi Wang · Na Zhao

[ ExHall D ]

Abstract
Active learning has emerged as a promising approach to reduce the substantial annotation burden in 3D object detection tasks, spurring several initiatives in outdoor environments. However, its application in indoor environments remains unexplored. Compared to outdoor 3D datasets, indoor datasets face significant challenges, including fewer training samples per class, a greater number of classes, more severe class imbalance, and more diverse scene types and intra-class variances.This paper presents the first study on active learning for indoor 3D object detection, where we propose a novel framework tailored for this task. Our method incorporates two key criteria - uncertainty and diversity - to actively select the most ambiguous and informative unlabeled samples for annotation. The uncertainty criterion accounts for both inaccurate detections and undetected objects, ensuring that the most ambiguous samples are prioritized. Meanwhile, the diversity criterion is formulated as a joint optimization problem that maximizes the diversity of both object class distributions and scene types, using a new Class-aware Adaptive Prototype (CAP) bank. The CAP bank dynamically allocates representative prototypes to each class, helping to capture varying intra-class diversity across different categories.We evaluate our method on SUN RGB-D and ScanNetV2, where it outperforms baselines by a significant margin, achieving over 85\% …
Poster
Shizhou Zhang · Xueqiang Lv · Yinghui Xing · Qirui Wu · Di Xu · Yanning Zhang

[ ExHall D ]

Abstract
Generative replay has gained significant attention in class-incremental learning; however, its application to Class Incremental Object Detection (CIOD) remains limited due to the challenges in generating complex images with precise spatial arrangements. In this study, motivated by the observation that the forgetting of prior knowledge is predominantly present in the classification sub-task as opposed to the localization sub-task, we revisit the generative replay method for class incremental object detection. Our method utilize a standard Stable Diffusion model to generate image-level replay data for all old and new tasks. Accordingly, the old detector and a stage-wise detector are conducted on the synthetic images respectively to determine the bounding box positions through pseudo-labeling. Furthermore, we propose to use a Similarity-based Cross Sampling mechanism to select more valuable confusing data between old and new tasks to more effectively mitigate catastrophic forgetting and reduce the false alarm rate for the new task. Finally, all synthetic and real data are integrated for current-stage detector training, where the images generated for previous tasks are highly beneficial in minimizing the forgetting of existing knowledge, while those synthesized for the new task can help bridge the domain gap between real and synthetic images. We conducted extensive experiments on …
Poster
Yu Zhou · Dian Zheng · Qijie Mo · Ren-Jie Lu · Kun-Yu Lin · Wei-Shi Zheng

[ ExHall D ]

Abstract
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes.To address it, we refine the retention term using dark knowledge” and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class.Without access to the remaining data or intervention (\ie, used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
Poster
Mauricio Byrd Victorica · György Dán · Henrik Sandberg

[ ExHall D ]

Abstract
Adversarial patches are capable of misleading computer vision systems based on convolutional neural networks. Existing recovery methods suffer of at least one of three fundamental shortcomings: no information about the presence of patches in the scene, inability to efficiently handle noncontiguous patch attacks, and a strong reliance on fixed saliency thresholds. We propose Saliuitl, a recovery method independent of the number of patches and their shape, which unlike prior works, explicitly detects patch attacks before attempting recovery. In our approach, detection is based on the attributes of a binarized feature map ensemble, which is generated by using an ensemble of saliency thresholds. If an attack is detected, Saliuitl recovers clean predictions locating patches guided by an ensemble of binarized feature maps and inpainting them. We evaluate Saliuitl on widely used object detection and image classification benchmarks from the adversarial patch literature, and our results show that compared to recent state-of-the-art defenses, Saliuitl achieves a recovery rate up to 97.81 and 42.63 percentage points higher at the same rate of lost predictions for image classification and object detection, respectively. By design, Saliuitl has low computational complexity and is robust to adaptive white-box attacks. Our code is available at https://github.com/Saliuitl/Saliuitl/tree/main.
Poster
Jiacong Xu · Shao-Yuan Lo · Bardia Safaei · Vishal M. Patel · Isht Dwivedi

[ ExHall D ]

Abstract
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in anomaly detection and reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning, based on LLaVA-OneVision. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens for its LLM. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Furthermore, extensions to medical and 3D anomaly reasoning are provided for future study.
Poster
Mojtaba Nafez · Amirhossein Koochakian · Arad Maleki · Jafar Habibi · Mohammad Rohban

[ ExHall D ]

Abstract
Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities.We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis.Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of 53.2% in AD and 68.5% in AL, while also maintaining competitive accuracy in non-adversarial settings.
Poster
Ankan Kumar Bhunia · Changjian Li · Hakan Bilen

[ ExHall D ]

Abstract
This paper introduces a novel anomaly detection (AD) problem aimed at identifying `odd-looking' objects within a scene by comparing them to other objects present. Unlike traditional AD benchmarks with fixed anomaly criteria, our task detects anomalies specific to each scene by inferring a reference group of regular objects. To address occlusions, we use multiple views of each scene as input, construct 3D object-centric models for each instance from 2D views, enhancing these models with geometrically consistent part-aware representations. Anomalous objects are then detected through cross-instance comparison. We also introduce two new benchmarks, ToysAD-8K and PartsAD-15K as testbeds for future research in this task. We provide a comprehensive analysis of our method quantitatively and qualitatively on these benchmarks. The datasets, source code, and models will be made publicly available upon publication.
Poster
Jia Guo · Shuai Lu · Weihang Zhang · Fang Chen · Hongen Liao · Huiqi Li

[ ExHall D ]

Abstract
Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we present Dinomaly, a minimalist reconstruction-based anomaly detection framework that harnesses pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisting of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Scalable foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of __99.6__%, __98.7__%, and __89.3__% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.
Poster
Fuyun Wang · Tong Zhang · Yuanzhi Wang · Yide Qiu · Xin Liu · Xu Guo · Zhen Cui

[ ExHall D ]

Abstract
In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.
Poster
Mohamed Afane · Gabrielle Ebbrecht · Ying Wang · Juntao Chen · Junaid Farooq

[ ExHall D ]

Abstract
Quantum Neural Networks (QNNs) offer promising capabilities for complex data tasks, but are often constrained by limited qubit resources and high entanglement, which can hinder scalability and efficiency. In this paper, we introduce Adaptive Threshold Pruning (ATP), an encoding method that reduces entanglement and optimizes data complexity for efficient computations in QNNs. ATP dynamically prunes non-essential features in the data based on adaptive thresholds, effectively reducing quantum circuit requirements while preserving high performance. Extensive experiments across multiple datasets demonstrate that ATP reduces entanglement entropy and improves adversarial robustness when combined with adversarial training methods like FGSM. Our results highlight ATP’s ability to balance computational efficiency and model resilience, achieving significant performance improvements with fewer resources, which will help make QNNs more feasible in practical, resource-constrained settings.
Poster
Yanda Chen · Gongwei Chen · Miao Zhang · Weili Guan · Liqiang Nie

[ ExHall D ]

Abstract
Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings.Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation.CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum.Extensive experiments demonstrate the effectiveness of CCFS, achieving significant improvements over the state-of-the-art: +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet in high-IPC settings. Notably, we achieve 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, yielding similar performance as full dataset training with only 0.3\% performance degradation.
Poster
Byeongho Heo · Taekyung Kim · Sangdoo Yun · Dongyoon Han

[ ExHall D ]

Abstract
Pre-training with random masked inputs has emerged as a novel trend in self-supervised training. However, supervised learning still faces a challenge in adopting masking augmentations, primarily due to unstable training. In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-model (MaskSub). MaskSub consists of the main-model and sub-model, the latter being a part of the former. The main-model undergoes conventional training recipes, while the sub-model merits intensive masking augmentations, during training. MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss. Our analysis shows that MaskSub significantly improves performance, with the training loss converging faster than in standard training, which suggests our method stabilizes the training process. We further validate MaskSub across diverse training scenarios and models, including DeiT-III training, MAE finetuning, CLIP finetuning, BERT training, and hierarchical architectures (ResNet and Swin Transformer). Our results show that MaskSub consistently achieves significant performance gains across all the cases. MaskSub provides a practical and effective solution for introducing additional regularization under various training recipes. Our code will be publicly available.
Poster
Qing Zhou · Junyu Gao · Qi Wang

[ ExHall D ]

Abstract
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement. To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum. We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples. SeTa reduces training costs by up to 50% while maintaining or improving performance, with minimal degradation even at 70% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (including CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of …
Poster
Eliahu Horwitz · Bar Cavia · Jonathan Kahana · Yedid Hoshen

[ ExHall D ]

Abstract
The increasing availability of public models begs the question: can we train neural networks that use other networks as input? Such models allow us to study different aspects of a given neural network, for example, determining the categories in a model's training dataset. However, machine learning on model weights is challenging as they often exhibit significant variation unrelated to the models' semantic properties (nuisance variation). Here, we identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. Concretely, while learning across Model Trees requires complex architectures, even a linear classifier trained on a single model layer often works within trees. While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. Notably, ProbeX is the first probing method specifically designed to learn from the weights of a single hidden model layer. We demonstrate the effectiveness of ProbeX by predicting the categories …
Poster
Sebastian Dziadzio · Vishaal Udandarao · Karsten Roth · Ameya Prabhu · Zeynep Akata · Samuel Albanie · Matthias Bethge

[ ExHall D ]

Abstract
Model merging combines expert'' models---each finetuned from a shared foundation model on diverse tasks and domains---into a single, more capable base model. However, existing model merging approaches assume all experts to be available simultaneously.In reality, new tasks and domains emerge continuously, prompting the need for a dynamic process of integrating these experts over time, which we call \textit{temporal model merging}. The temporal dimension introduces unique challenges not addressed in prior work:At each task, should expert training start from merged previous experts or the original base model? Should all models be merged at every time step? Which merging techniques are best suited for temporal merging? Should different strategies be used for the training initialization and deployment phases? To tackle these questions, we propose a unified framework called \textsc{TIME}---\underline{T}emporal \underline{I}ntegration of \underline{M}odel \underline{E}xpertise---that defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Utilizing \textsc{TIME}, we study temporal model merging across model sizes, tasks, and compute budgets on the large-scale FoMo-in-Flux benchmark for continual multimodal pretraining. Systematic experiments across \textsc{TIME} and FoMo-in-Flux allow us to arrive at several crucial key insights for temporal model merging to better understand current limits and best practices for successful …
Poster
Xiaohan Qin · Xiaoxing Wang · Junchi Yan

[ ExHall D ]

Abstract
Multi-task learning (MTL) can leverage shared knowledge across tasks to improve data efficiency and generalization performance, and has been applied in various scenarios. However, task imbalance remains a major challenge for existing MTL methods. While the prior works have attempted to mitigate inter-task unfairness through loss-based and gradient-based strategies, they still exhibit imbalanced performance across tasks on common benchmarks.This key observation motivates us to consider performance-level information as an explicit fairness indicator, which can precisely reflect the current optimization status of each task, and accordingly help to adjust the gradient aggregation process.Specifically, we utilize the performance variance among tasks as the fairness indicator and introduce a dynamic weighting strategy to gradually reduce the performance variance. Based on this, we propose PIVRG, a novel performance-informed variance reduction gradient aggregation approach.Extensive experiments show that PIVRG achieves SOTA performance across various benchmarks, spanning both supervised learning and reinforcement learning tasks with task numbers ranging from 2 to 40. Results from the ablation study also show that our approach can be integrated into existing methods, significantly enhancing their performance while reducing the performance variance among tasks, thus achieving fairer optimization.
Poster
Sihao Liu · Yibo Yang · Xiaojie Li · David A. Clifton · Bernard Ghanem

[ ExHall D ]

Abstract
Online continual learning (OCL) seeks to learn new tasks from data streams that appear only once, while retaining knowledge of previously learned tasks. Most existing methods rely on replay, focusing on enhancing memory retention through regularization or distillation. However, they often overlook the adaptability of the model, limiting the ability to learn generalizable and discriminative features incrementally from online training data.To address this, we introduce a plug-and-play module, S6MOD, which can be integrated into most existing methods and directly improve adaptability. Specifically, S6MOD introduces an extra branch after the backbone, where a mixture of discretization selectively adjusts parameters in a selective state space model, enriching selective scan patterns such that the model can adaptively select the most sensitive discretization method for current dynamics.We further design a class-conditional routing algorithm for dynamic, uncertainty-based adjustment and implement a contrastive discretization loss to optimize it. Extensive experiments combining our module with various models demonstrate that S6MOD significantly enhances model adaptability, leading to substantial performance gains and achieving the state-of-the-art results.
Poster
Fei Ye · Adrian Bors

[ ExHall D ]

Abstract
Recent continuous learning (CL) research primarily addresses catastrophic forgetting within a straightforward learning framework where class and task information are predefined. However, in more realistic and challenging CL scenarios, such supervised information is typically absent. In this paper, we address this challenging CL scenario by introducing an innovative memory management approach, by incorporating a dynamic memory system for storing selected representatives from evolving data while a dynamically expandable memory system enables retaining essential long-term knowledge. Specifically, the dynamic expandable memory system manages a series of memory distributions, each designed to represent the information from a distinct data category. We propose a new memory expansion mechanism that assesses the proximity between incoming samples and existing memory distributions, utilizing this evaluation to incrementally add new memory distributions into the system. Additionally, a novel memory distribution augmentation technique is proposed for selectively gathering suitable samples for each memory distribution, enhancing the statistical robustness over time. To prevent memory saturation before the training phase, we introduce a memory distribution reduction strategy that automatically eliminates overlapping memory distributions, ensuring adequate capacity for accommodating new information in subsequent learning episodes. We conduct a series of experiments demonstrating that our proposed approach attains state-of-the-art performance in both …
Poster
Zijian Gao · Wangwang Jia · Xingxing Zhang · Dulan Zhou · Kele Xu · Feng Dawei · Yong Dou · Xinjun Mao · Huaimin Wang

[ ExHall D ]

Abstract
Class-Incremental Learning (CIL) enables models to continuously learn new classes while mitigating catastrophic forgetting. Recently, Pre-Trained Models (PTMs) have greatly enhanced CIL performance, even when fine-tuning is limited to the first task. This advantage is particularly beneficial for CIL methods that freeze the feature extractor after first-task fine-tuning, such as analytic learning-based approaches using a least squares solution-based classification head to acquire knowledge recursively. In this work, we revisit the analytical learning approach combined with PTMs and identify its limitations in adapting to new classes, leading to sub-optimal performance. To address this, we propose the **Mo**mentum-based **A**nalytical **L**earning (**MoAL**) approach. MoAL achieves robust knowledge memorization via an analytical classification head and improves adaptivity to new classes through momentum-based adapter weight interpolation, also leading to forgetting outdated knowledge. Importantly, we introduce a knowledge rumination mechanism that leverages refined adaptivity, allowing the model to revisit and reinforce old knowledge, thereby improving performance on old classes. MoAL facilitates the acquisition of new knowledge and consolidates old knowledge, achieving a win-win outcome between plasticity and stability. Extensive experiments on multiple datasets and incremental settings demonstrate that MoAL significantly outperforms current state-of-the-art methods.
Poster
Arnav Mohanty Das · Gantavya Bhatt · Lilly Kumari · Sahil Verma · Jeff Bilmes

[ ExHall D ]

Abstract
Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime, e.g., few-shot learning. Prior approaches have employed only \emph{nearest-neighbor} based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the training set. However, these approaches are prone to selecting highly redundant samples, since they fail to incorporate any notion of diversity. In our work, we first demonstrate that data selection strategies used in prior retrieval-augmented few-shot learning settings can be generalized using a class of functions known as Combinatorial Mutual Information (CMI) measures. We then propose COBRA (COmBinatorial Retrieval Augmentation), which employs an alternative CMI measure that considers both diversity and similarity to a target dataset. COBRA consistently outperforms previous retrieval approaches across datasets and few-shot learning techniques when used to retrieve samples from LAION-2B. Using COBRA introduces negligible computational overhead to the cost of retrieval, while providing significant gains in downstream model performance
Poster
Da-Wei Zhou · Zi-Wen Cai · Han-Jia Ye · Lijun Zhang · De-Chuan Zhan

[ ExHall D ]

Abstract
Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains. While recent advances in pre-trained models provide a solid foundation for DIL, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge. Specifically, sequential model updates can overwrite both the representation and the classifier with knowledge from the latest domain. Thus, it is crucial to develop a representation and corresponding classifier that accommodate all seen domains throughout the learning process. To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. By merging the backbone of different stages, we create a representation space suitable for multiple domains incrementally. The merged representation serves as a balanced intermediary that captures task-specific features from all seen domains. Additionally, to address the mismatch between consolidated embeddings and the classifier, we introduce an extra classifier consolidation process. Leveraging class-wise semantic information, we estimate the classifier weights of old domains within the latest embedding space. By merging historical and estimated classifiers, we align them with the consolidated embedding space, facilitating incremental classification. Extensive experimental results on four benchmark datasets demonstrate Duct's state-of-the-art performance.
Poster
Aristotelis Ballas · Christos Diou

[ ExHall D ]

Abstract
Domain Generalization (DG) research has gained considerable traction as of late,since the ability to generalize to unseen data distributions is a requirementthat eludes even state-of-the-art training algorithms. In this paper we observethat the initial iterations of model training play a key role in domaingeneralization effectiveness, since the loss landscape may be significantlydifferent across the training and test distributions, contrary to the case ofi.i.d. data. Conflicts between gradients of the loss components of each domainlead the optimization procedure to undesirable local minima that do not capturethe domain-invariant features of the target classes. We propose alleviatingdomain conflicts in model optimization, by iteratively annealing the parametersof a model in the early stages of training and searching for points wheregradients align between domains. By discovering a set of parameter values where gradientsare updated towards the same direction for each data distribution present in thetraining set, the proposed Gradient-Guided Annealing (GGA) algorithm encouragesmodels to seek out minima that exhibit improved robustness against domainshifts. The efficacy of GGA is evaluated on four widely accepted and challengingimage classification domain generalization benchmarks, where its use alone isable to establish highly competitive or even state-of-the-art performance.Moreover, when combined with previously proposed domain-generalizationalgorithms it is able to consistently improve their …
Poster
Xiangyu Chang · Fahim Faisal Niloy · Sk Miraj Ahmed · Srikanth Krishnamurthy · Basak Guler · Ananthram Swami · Samet Oymak · Amit K. Roy-Chowdhury

[ ExHall D ]

Abstract
Incorporating transformer models into edge devices poses a significant challenge due to the computational demands of adapting these large models across diverse applications. Parameter-efficient tuning (PET) methods like LoRA, Adapter, and Visual Prompt Tuning (VPT) allow for targeted adaptation by modifying only small parts of the transformer model. However, adapting to dynamic, unlabeled target distributions at test time remains complex. To address this, we introduce AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments. AdMiT innovates by pre-training a set of PET modules, each optimized for different source distributions or tasks, and dynamically selecting and integrating a sparse subset of relevant modules when encountering a new, few-shot, unlabeled target distribution. This integration leverages Kernel Mean Embedding (KME)-based matching to align the target distribution with relevant source knowledge efficiently, without requiring additional routing networks or hyperparameter tuning. AdMiT achieves adaptation with a single inference step, making it particularly suitable for resource-constrained edge deployments. Furthermore, AdMiT preserves privacy by performing adaptation locally on each edge device, with no data exchange required. Our theoretical analysis establishes guarantees for AdMiT's generalization, while extensive benchmarks demonstrate that AdMiT consistently outperforms other PET methods across a range of tasks, achieving robust and efficient adaptation.
Poster
Hassan Mahmood · Ehsan Elhamifar

[ ExHall D ]

Abstract
Generating targeted universal perturbations for multi-label recognition is a combinatorially hard problem that requires exponential time and space complexity. To address the problem, we propose a compositional framework. We show that a simple independence assumption on label-wise universal perturbations naturally leads to an efficient optimization that requires learning affine convex cones spanned by label-wise universal perturbations, significantly reducing the problem complexity to linear time and space. During inference, the framework allows generating universal perturbations for novel combinations of classes in constant time. We demonstrate the scalability of our method on large datasets and target sizes, evaluating its performance on NUS-WIDE, MS-COCO, and OpenImages using state-of-the-art multi-label recognition models. Our results show that our approach outperforms baselines and achieves results comparable to methods with exponential complexity.
Poster
Tianhao Ma · Han Chen · Juncheng Hu · Yungang Zhu · Ximing Li

[ ExHall D ]

Abstract
Learning from label proportions (LLP), i.e. a challenging weakly-supervised learning task, aims to train a classifier by using bags of instances and the proportions of classes within bags, rather than annotated labels for each instance. Beyond the traditional bag-level loss, the mainstream methodology of LLP is to incorporate an auxiliary instance-level loss with pseudo-labels formed by predictions. Unfortunately, we empirically observed that the pseudo-labels are often inaccurate and even meaningless, especially for the scenarios with large bag sizes, hurting the classifier induction. To alleviate this problem, we suggest a novel LLP method, namely Learning Label Proportions with Auxiliary High-confident Instance-level Loss (L^2P-AHIL). Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. It simultaneously emphasizes accurate predictions at the bag level and avoids smoothing predictions, which tend to be meaningless. We then form high-confident instance-level loss with DEW, and jointly optimize it with the bag-level loss in a self-training manner. The experimental results on benchmark datasets show that L^2P-AHIL can surpass the existing baseline methods, and the performance gain can be more significant as the bag size increases.
Poster
Jae Hyeon Park · Joo Hyeon Jeon · Jae Yun Lee · Sangyeon Ahn · MinHee Cha · Min Geol Kim · Hyeok Nam · Sung In Cho

[ ExHall D ]

Abstract
This study addresses the limitations of existing dynamic pseudo-labeling techniques, which often utilize static or dynamic thresholds for confident sample selection. Traditional methods fail to capture the non-linear relationship between task accuracy and model confidence, particularly in the context of overconfidence, thus limiting learning opportunities for sensitive samples that significantly influence a model's generalization ability. To solve this, we propose a novel gradient pass-based dynamic pseudo-labeling (DPL) technique that incorporates high-entropy samples, which are typically overlooked. Our approach introduces two classifiers—low gradient pass (LGP) and high gradient pass (HGP)—to derive sensitive dynamic thresholds (SDT) and underconfident dynamic thresholds (UDT), respectively. By effectively combining these thresholds with those from converged and overconfident states, we aim to create a more adaptive and effective learning strategy. Our main contributions highlight the importance of considering both low and high-confidence samples in enhancing model robustness and generalization for improved PL performance.
Poster
Erik Wallin · Fredrik Kahl · Lars Hammarstrand

[ ExHall D ]

Abstract
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given label hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the label hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the label hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined label hierarchies and show the effectiveness of our method. Our code is provided as supplementary material.
Poster
Divya M Shanmugam · Helen Lu · Swami Sankaranarayanan · John Guttag

[ ExHall D ]

Abstract
A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.
Poster
Xiangtao Zhang · Sheng Li · Ao Li · Yipeng Liu · Fan Zhang · Ce Zhu · Le Zhang

[ ExHall D ]

Abstract
Heterogeneous Federated Learning (HFL) has received widespread attention due to its adaptability to different models and data. The HFL approach utilizing auxiliary models for knowledge transfer enhances flexibility. However, existing frameworks face the challenges of aggregation bias and local overfitting. To address these issues, we propose FedSCE. It reduces the degree of freedom of update and improves generalization performance by limiting the specific layer of local model update to the local subspace. The subspace is dynamically updated to ensure coverage of the latest model update trajectory. Additionally, FedSCE evaluates client contributions based on the update distance of the auxiliary model in feature space and parameter space, achieving adaptive weighted aggregation. We validate our approach in both feature-skewed and label-skewed scenarios, demonstrating that on the Office10, our method exceeds the best baseline by 3.87. Our source code will be released.
Poster
Gongxi Zhu · Donghao Li · Hanlin Gu · Yuan Yao · Lixin Fan · Yuxing Han

[ ExHall D ]

Abstract
Federated Learning (FL) is a promising approach for training machine learning models on decentralized data while preserving privacy. However, privacy risks, particularly Membership Inference Attacks (MIAs), which aim to determine whether a specific data point belongs to a target client’s training set, remain a significant concern. Existing methods for implementing MIAs in FL primarily analyze updates from the target client, focusing on metrics such as loss, gradient norm, and gradient difference. However, these methods fail to leverage updates from non-target clients, potentially underutilizing available information.In this paper, we first formulate a one-tailed likelihood-ratio hypothesis test based on the likelihood of updates from non-target clients. Building upon this formulation, we introduce a three-stage Membership Inference Attack (MIA) method, called FedMIA, which follows the "all for one"—leveraging updates from all clients across multiple communication rounds to enhance MIA effectiveness. Both theoretical analysis and extensive experimental results demonstrate that FedMIA outperforms existing MIAs in both classification and generative tasks. Additionally, it can be integrated as an extension to existing methods and is robust against various defense strategies, Non-IID data, and different federated structures.
Poster
Jiahao Xu · Zikai Zhang · Rui Hu

[ ExHall D ]

Abstract
The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model’s performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. Code is available at \url{https://anonymous.4open.science/r/AlignIns}.
Poster
Fan Xing · Zhuo Tian · Xuefeng Fan · Xiaoyi Zhou

[ ExHall D ]

Abstract
Reversible Adversarial Examples (RAE) are designed to protect the intellectual property of datasets. Such examples can function as imperceptible adversarial examples to erode the model performance of unauthorized users while allowing authorized users to remove the adversarial perturbations and recover the original samples for normal model training. With the rise of Self-Supervised Learning (SSL), an increasing number of unlabeled datasets and pre-trained encoders are available in the community. However, existing RAE methods not only rely on well-labeled datasets for training Supervised Learning (SL) models but also exhibit poor adversarial transferability when attacking SSL pre-trained encoders. To address these challenges, we propose RAEncoder, the first framework for RAEs without the need for labeled samples. RAEncoder aims to generate universal adversarial perturbations by targeting SSL pre-trained encoders. Unlike traditional RAE approaches, the pre-trained encoder outputs the feature distribution of the protected dataset rather than classification labels, enhancing both the attack success rate and transferability of RAEs. Extensive experiments are conducted on six pre-trained encoders and four SL models, covering aspects such as imperceptibility and transferability. Our results demonstrate that RAEncoder effectively protects unlabeled datasets from malicious infringements. Additional robustness experiments further confirm the security of RAEncoder in practical application scenarios.
Poster
Sizai Hou · Songze Li · Duanyi Yao

[ ExHall D ]

Abstract
Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders mismatch triggered inputs with target embeddings, e.g., match the triggered cat input to an airplane embedding, such that the downstream tasks are affected to misbehave when the trigger is activated. Emerging backdoor attacks have shown great threats in different SSL paradigms such as contrastive learning and CLIP, while few research is devoted to defending against such attacks. Besides, the existing ones fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DEDE, which detects the activation of the backdoor mapping with the cooccurrence of victim encoder and trigger inputs. Specifically, DEDE trains a decoder for the SSL encoder on an auxiliary dataset (can be out-of-distribution or even slightly poisoned), such that for any triggered input that misleads to the target embedding, the decoder outputs an image significantly different from the input. We empirically evaluate DEDE on both contrastive learning and CLIP models against various types of backdoor attacks, and demonstrate its superior performance …
Poster
Shixin Li · Chaoxiang He · Xiaojing Ma · Bin Benjamin Zhu · Shuo Wang · Hongsheng Hu · Dongmei Zhang · Linchen Yu

[ ExHall D ]

Abstract
Adversarial attacks threaten the integrity of deep neural networks (DNNs), particularly in high-stakes applications. This paper explores an innovative black-box adversarial attack strategy leveraging checkpoints from a single model’s training trajectory. Unlike traditional ensemble attacks that require multiple surrogate models of different architectures, our approach utilizes a single model’s diverse training checkpoints to craft adversarial examples. By categorizing the knowledge learned during training into task-intrinsic and task-irrelevant knowledge, we identify checkpoints that predominantly capture task-intrinsic knowledge, which generalizes across different models. We introduce an accuracy gap-based selection strategy to enhance the transferability of adversarial examples to models with different architectures. Extensive experiments on benchmark datasets, including ImageNet and CIFAR-10, demonstrate that our method consistently outperforms traditional model ensemble attacks in terms of transferability. Furthermore, our approach remains highly effective even with significantly reduced training data, offering a practical and resource-efficient solution for highly transferable adversarial attacks.
Poster
Yuan Xiao · Yuchen Chen · Shiqing Ma · Chunrong Fang · Tongtong Bai · Mingzheng Gu · Yuxin Cheng · Yanwei Chen · Zhenyu Chen

[ ExHall D ]

Abstract
The robustness of neural network classifiers is important in the safety-critical domain and can be quantified by robustness verification. At present, efficient and scalable verification techniques are always sound but incomplete, and thus, the improvement of verified robustness results is the key criterion to evaluate the performance of incomplete verification approaches. The multi-variate function MaxPool is widely adopted yet challenging to verify. In this paper, we present \textbf{Ti-Lin}, a robustness verifier for MaxPool-based CNNs with \textbf{Ti}ght \textbf{Lin}ear Approximation. Following the sequel of minimizing the over-approximation zone of the non-linear function of CNNs, we are the first to propose the provably neuron-wise tightest linear bounds for the MaxPool function. By our proposed linear bounds, we can certify larger robustness results for CNNs. We evaluate the effectiveness of Ti-Lin on different verification frameworks with open-sourced benchmarks, including LeNet, PointNet, and networks trained on the MNIST, CIFAR-10, Tiny ImageNet and ModelNet40 datasets. Experimental results show that Ti-Lin significantly outperforms the state-of-the-art methods across all networks with up to 78.6\% improvement in terms of the certified accuracy with almost the same time consumption as the fastest tool. Our code is available at \url{https://anonymous.4open.science/r/Ti-Lin-cvpr-72EE}.
Poster
Quanjiang Li · Tingjin Luo · Jiahui Liao

[ ExHall D ]

Abstract
Incomplete features and label noise in multi-view multi-label data significantly undermine the reliability and performance, motivating researchers to explore the mechanism of representation and information recovery. However, learning for such dual deficiencies is crucial but rarely studied. In this paper, we propose a theory-inspired Deep Multi-View Multi-Label Learning method with Incomplete Views and Noisy Labels named DMMIvNL to address these problems.Specifically, to promote the synthesis of task-relevant shared information and preserve the distinctiveness of individual features from limited views, we have developed a feature extraction modular based on the information bottleneck theory, and formulated its theoretical upper bound into its objective.Meanwhile, we theoretically prove that minimizing the volume of the transition matrix ensures the statistical consistency with classifier training. Besides, a cycle-consistent estimation principle is proposed in the volume minimization network to improve the recognition stability of multi-label noise. Moreover, leveraging inherent real semantics information and label correlations are employed as model regularization to reduce the risk of excessive noise fitting.Finally, extensive experimental results validate the effectiveness and robustness of our DMMIvNL.
Poster
Baili Xiao · Zhibin Dong · KE LIANG · Suyuan Liu · Siwei Wang · Tianrui Liu · Xingchen Hu · En Zhu · Xinwang Liu

[ ExHall D ]

Abstract
Multi-view clustering represents one of the most established paradigms within the field of unsupervised learning and has witnessed a surge in popularity in recent years. View-pair form contrastive learning allows for consistently representing multiple views by maximizing mutual information between each two views. This approach permits multi-view clustering to discern consistent latent representations across multiple views. However, two significant issues emerge when this approach is considered. i)it is challenging to ascertain which two views are most appropriate for contrastive learning when there are more than three views, particularly without prior knowledge. ii) when all views are included in contrastive learning, multi-view clustering performance is compromised by poor quality views. To tackle these issues, we present a novel Efficient Dual Selection Mechanism for deep Multi-View Clustering framework, termed EASEMVC. Specifically, EASEMVC first constructs a view graph based on the OT distance between the bipartite graph of each view. It then designs a view selection module to realize an efficient view-level selection process through the view topology relations in the view graph structure. Additionally, a cross-view sample graph structure is constructed at the sample level, with the sample topological relations in the cross-view sample graph structure being employed to generate reliable sample …
Poster
Jiyuan Liu · Xinwang Liu · chuankun Li · Xinhang Wan · Hao Tan · Yi Zhang · Weixuan Liang · Qian Qu · Yu Feng · Renxiang Guan · KE LIANG

[ ExHall D ]

Abstract
Multi-view clustering is a long-standing hot topic in machine learning communities, due to its capability of integrating data information from multiple sources and modalities. By utilizing tensor Singular Value Decomposition (t-SVD) technique with the tensor rotation trick, recent advances have achieved remarkable improvements on clustering performance. However, we find this is attributed to the inadvertent use of sequential information of sorted data samples, i.e. inadvertent label use, which violates the unsupervised learning setting. On the other hand, existing large-scale approaches are mostly developed on the basis of matrix factorization or anchor techniques, thereby fail to consider the similarities among all data samples, preventing from further performance improvement. To address the above issues, we first analyze the tensor rotation trick and recommend to remove it from tensor clustering. On its basis, a novel large-scale multi-view tensor clustering method is developed by incorporating the pair-wise similarities with implicit linear kernel function. To solve the resultant optimization problem, we design an efficient algorithm of linear complexity. Moreover, extensive experiments are conducted and corresponding results well support the aforementioned finding and validate the effectiveness and efficiency of the proposed method.
Poster
JungKyoo Shin · Bumsoo Kim · Eunwoo Kim

[ ExHall D ]

Abstract
Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more flexible alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.
Poster
Siyuan Duan · Yuan Sun · Dezhong Peng · Zheng Liu · Xiaomin Song · Peng Hu

[ ExHall D ]

Abstract
Cross-modal retrieval aims to match related samples across distinct modalities, facilitating the retrieval and discovery of heterogeneous information. Although existing methods show promising performance, most are deterministic models and are unable to capture the uncertainty inherent in the retrieval outputs, leading to potentially unreliable results. To address this issue, we propose a novel framework called FUzzy Multimodal lEarning (FUME), which is able to self-estimate epistemic uncertainty, thereby embracing trusted cross-modal retrieval. Specifically, our FUME leverages the Fuzzy Set Theory to view the outputs of the classification network as a set of membership degrees and quantify category credibility by incorporating both possibility and necessity measures. However, directly optimizing the category credibility could mislead the model by over-optimizing the necessity for unmatched categories. To overcome this challenge, we present a novel fuzzy multimodal learning strategy, which utilizes label information to guide necessity optimization in the right direction, thereby indirectly optimizing category credibility and achieving accurate decision uncertainty quantification. Furthermore, we design an uncertainty merging scheme that accounts for decision uncertainties, thus further refining uncertainty estimates and boosting the trustworthiness of retrieval results. Extensive experiments on five benchmark datasets demonstrate that FUME remarkably improves both retrieval performance and reliability, offering a prospective solution …
Poster
Tal Zeevi · Ravid Shwartz-Ziv · Yann LeCun · Lawrence Staib · John A Onofrey

[ ExHall D ]

Abstract
Accurate uncertainty estimation is crucial for deploying neural networks in risk-sensitive applications such as medical diagnosis. Monte Carlo Dropout is a widely used technique for approximating predictive uncertainty by performing stochastic forward passes with dropout during inference. However, using static dropout rates across all layers and inputs can lead to suboptimal uncertainty estimates, as it fails to adapt to the varying characteristics of individual inputs and network layers. Existing approaches optimize dropout rates during training using labeled data, resulting in fixed inference-time parameters that cannot adjust to new data distributions, compromising uncertainty estimates in Monte Carlo simulations.In this paper, we propose Rate-In, an algorithm that dynamically adjusts dropout rates during inference by quantifying the information loss induced by dropout in each layer's feature maps. By treating dropout as controlled noise injection and leveraging information-theoretic principles, Rate-In adapts dropout rates per layer and per input instance without requiring ground truth labels. By quantifying the functional information loss in feature maps, we adaptively tune dropout rates to maintain perceptual quality across diverse medical imaging tasks and architectural configurations. Our empirical results on synthetic data and real-world medical imaging tasks demonstrate that Rate-In improves calibration and sharpens uncertainty estimates compared to fixed or …
Poster
Divya Velayudhan · Abdelfatah Ahmed · Mohamad Alansari · Neha Gour · Abderaouf Behouch · Taimur Hassan · Syed Talal Wasim · Nabil Maalej · Muzammal Naseer · Jürgen Gall · Mohammed Bennamoun · Ernesto Damiani · Naoufel Werghi

[ ExHall D ]

Abstract
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Our code, data, and pre-trained models will be made publicly available.
Poster
Yang Yue · Yulin Wang · Chenxin Tao · Pan Liu · Shiji Song · Gao Huang

[ ExHall D ]

Abstract
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models will be …
Poster
Jianwei Zhao · XIN LI · Fan Yang · Qiang Zhai · Ao Luo · Yang Zhao · Hong Cheng · Huazhu Fu

[ ExHall D ]

Abstract
Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks—Camelyon16, TCGA-NSCLC, and BRACS—consistently achieving state-of-the-art performance in both binary and multi-class tasks. The model and code will be made publicly available upon acceptance.
Poster
Xianrui Li · Yufei Cui · Jun Li · Antoni B. Chan

[ ExHall D ]

Abstract
Advances in medical imaging and deep learning have propelled progress in whole slide image (WSI) analysis, with multiple instance learning (MIL) showing promise for efficient and accurate diagnostics. However, conventional MIL models often lack adaptability to evolving datasets, as they rely on static training that cannot incorporate new information without extensive retraining. Applying continual learning (CL) to MIL models is a possible solution, but often sees limited improvements. In this paper, we analyze CL in the context of attention MIL models and find that the model forgetting is mainly concentrated in the attention layers of the MIL model. Using the results of this analysis we propose two components for improving CL on MIL:Attention Knowledge Distillation (AKD) and the Pseudo-Bag Memory Pool (PMP). AKD mitigates catastrophic forgetting by focusing on retaining attention layer knowledge between learning sessions, while PMP reduces the memory footprint by selectively storing only the most informative patches, or ''pseudo-bags'' from WSIs. Experimental evaluations demonstrate that our method significantly improves both accuracy and memory efficiency on diverse WSI datasets, outperforming current state-of-the-art CL methods. This work provides a foundation for CL in large-scale, weakly annotated clinical datasets, paving the way for more adaptable and resilient diagnostic models.
Poster
Hang Shi · Chi Changxi · Peng Wan · Daoqiang Zhang · WEI SHAO

[ ExHall D ]

Abstract
The rapid development of spatial transcriptomics (ST) allows researchers to measure the spatial-level gene expression in tissues. Although powerful, the cost for collecting the ST data is expensive, and thus several studies aim to predict gene expression in ST by utilizing their corresponding H/E stained pathology images. The existing ST based gene expression prediction models either adopt the pre-trained networks or rely on the handcrafted features to describe the pathology images, which still lack a systematic way to combine them together to define a spot-level representation that can reflect the topological profiles of different spots. On the other hand, all the ST based gene prediction models treat the prediction task for each gene independently, which overlook the fact that the exploration of potential interrelationships among them can help improve the prediction performance for individual genes.To address the above issues, we propose a multi-modal topology-embedded graph learning algorithm guided by prior Gene Ontology similarity information (i.e., M2TGLGO) to predict the spatial resolved genes from pathology image. Specifically, M2TGLGO co-learns the image representation of different spots from both deep and handcrafted features by considering the within-modal and inter-modal interactions. Next, to keep the topological structure among different spots, a spatial-oriented ranking module …
Poster
Marcus Nordström · Atsuto Maki · Henrik Hult

[ ExHall D ]

Abstract
In image segmentation and specifically in medical image segmentation, the soft-Dice loss is often chosen instead of the more traditional cross-entropy loss to improve performance with respect to the Dice metric.Experimental work supporting this claim exists, but how and why the two loss functions lead to different predictions is not well understood.This paper explains the observed discrepancy as a consequence of how those loss functions are effected by label noise and what threshold is used to convert the predicted soft segmentation to predicted labels.In particular, it is shown (i) how the optimal solutions to the two loss functions diverge as the noise is increased, and (ii) how the optimal solutions to soft-Dice can be recovered by thresholding the solutions to cross-entropy with an a priori unknown but efficiently computable threshold.The theoretical results are supported by numerical experiments and it is concluded that cross-entropy with the alternative threshold yields the stability and informative label probability maps associated with cross-entropy without sacrificing the performance of soft-Dice.
Poster
Yunhe Gao · Di Liu · Zhuowei Li · Yunsheng Li · Dongdong Chen · Mu Zhou · Dimitris N. Metaxas

[ ExHall D ]

Abstract
Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present \textbf{Iris}, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. Given a decoupled architecture on 3D data processing, Iris supports diverse inference strategies including one-shot inference, context example ensemble, object-level context example retrieval, and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to specialized supervised models on in-distribution tasks. On seven held-out dataset, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris's task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into cross-modality medical objects without explicit anatomical supervision.
Poster
Junlong Cheng · Bin Fu · Jin Ye · Guoan Wang · Tianbin Li · Haoyu Wang · Ruoyu Li · He Yao · Chen Junren · Jingwen Li · Yanzhou Su · Min Zhu · Junjun He

[ ExHall D ]

Abstract
Interactive Medical Image Segmentation (IMIS) has long been constrained by the limited availability of large-scale, diverse, and densely annotated datasets, which hinders model generalization and consistent evaluation across different models. In this paper, we introduce the IMed-361M benchmark dataset, a significant advancement in general IMIS research. First, we collect and standardize over 6.4 million medical images and their corresponding ground truth masks from multiple data sources. Then, leveraging the strong object recognition capabilities of a vision foundational model, we automatically generated dense interactive masks for each image and ensured their quality through rigorous quality control and granularity management. Unlike previous datasets, which are limited by specific modalities or sparse annotations, IMed-361M spans 14 modalities and 204 segmentation targets, totaling 361 million masks—an average of 56 masks per image. Finally, we developed an IMIS baseline network on this dataset that supports high-quality mask generation through interactive inputs, including clicks, bounding boxes, text prompts, and their combinations. We evaluate its performance on medical image segmentation tasks from multiple perspectives, demonstrating superior accuracy and scalability compared to existing interactive segmentation models. To facilitate research on foundational models in medical computer vision, we release the IMed-361M and model at https://anonymous.4open.science/r/IMIS-Bench-FF8B.
Poster
Yanfeng Zhou · Lingrui Li · Le Lu · Minfeng Xu

[ ExHall D ]

Abstract
Semantic segmentation is a crucial prerequisite in clinical applications and computer-aided diagnosis. With the development of deep neural networks, biomedical image segmentation has achieved remarkable success. Encoder-decoder architectures that integrate convolutions and transformers are gaining attention for their potential to capture both global and local features. However, current designs face the contradiction that these two features cannot be continuously transmitted. In addition, some models lack a unified and standardized evaluation benchmark, leading to significant discrepancies in the experimental setup. In this study, we review and summarize these architectures and analyze their contradictions in design. We modify UNet and propose WNet to combine transformers and convolutions, addressing the transmission issue effectively. WNet captures long-range dependencies and local details simultaneously while ensuring their continuous transmission and multi-scale fusion. We integrate WNet into the nnUNet framework for unified benchmarking. Our model achieves state-of-the-art performance in biomedical image segmentation. Extensive experiments demonstrate their effectiveness on four 2D datasets (DRIVE, ISIC-2017, Kvasir-SEG, and CREMI) and four 3D datasets (Parse2022, AMOS22, BTCV, and ImageCAS). The code is available at https://github.com/XXX.
Poster
Yufan He · Pengfei Guo · Yucheng Tang · Andriy Myronenko · Vishwesh Nath · Ziyue Xu · Dong Yang · Can Zhao · Benjamin D. Simon · Mason Belue · Stephanie Anne Harmon · Baris Turkbey · Daguang Xu · Wenqi Li

[ ExHall D ]

Abstract
Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all thesechallenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting …
Poster
Bastian Wittmann · Yannick Wattenberg · Tamaz Amiranashvili · Suprosanna Shit · Bjoern Menze

[ ExHall D ]

Abstract
Segmenting 3D blood vessels is a critical yet challenging task in medical image analysis. This is due to significant imaging modality-specific variations in artifacts, vascular patterns and scales, signal-to-noise ratios, and background tissues. These variations, along with domain gaps arising from varying imaging protocols, limit the generalization of existing supervised learning-based methods, requiring tedious voxel-level annotations for each dataset separately. While foundation models promise to alleviate this limitation, they typically fail to generalize to the task of blood vessel segmentation, posing a unique, complex problem. In this work, we present vesselFM, a foundation model designed specifically for the broad task of 3D blood vessel segmentation. Unlike previous models, vesselFM can effortlessly generalize to unseen domains. To achieve zero-shot generalization, we train vesselFM on three heterogeneous data sources: a large, curated annotated dataset, data generated by a domain randomization scheme, and data sampled from a flow matching-based generative model. Extensive evaluations show that vesselFM outperforms state-of-the-art medical image segmentation foundation models across four (pre-)clinically relevant imaging modalities in zero-, one-, and few-shot scenarios, therefore providing a universal solution for 3D blood vessel segmentation.
Poster
zhuangzhuang chen · hualiang wang · Chubin Ou · Xiaomeng Li

[ ExHall D ]

Abstract
Optical coherence tomography angiography (OCTA) shows its great importance in imaging microvascular networks by providing accurate 3D imaging of blood vessels, but it relies upon specialized sensors and expensive devices. For this reason, previous works show the potential to translate the readily available 3D Optical Coherence Tomography (OCT) images into 3D OCTA images. However, existing OCTA translation methods directly learn the mapping from the OCT domain to the OCTA domain in continuous and infinite space with guidance from only a single view, i.e., the OCTA project map, resulting in suboptimal reconstruction results. To this end, we propose the multi-view Tri-alignment framework for OCT to OCTA 3D image translation in discrete and finite space, named \emph{MuTri}. In the first stage, we pre-train two vector-quantized variational auto-encoder (VQVAE) via the reconstruction of 3D OCT and 3D OCTA data, providing semantic prior for subsequent multi-view guidances. In the second stage, our multi-view tri-alignment facilitates another VQVAE model to learn the mapping from the OCT domain to the OCTA domain in discrete and finite space. Specifically, a contrastive-inspired semantic alignment is proposed to maximize the mutual information with the pre-trained models from OCT and OCTA views, to facilitate codebook learning. Meanwhile, a vessel structure …