Poster
Gaoxiang Cong · Jiadong Pan · Liang Li · Yuankai Qi · Yuxin Peng · Anton van den Hengel · Jian Yang · Qingming Huang
[ ExHall D ]
Abstract
Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user's emotion instructions by the positive and negative guidance …
Poster
Jihoon Kim · Jeongsoo Choi · Jaehun Kim · Chaeyoung Jung · Joon Chung
[ ExHall D ]
Abstract
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis.A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech.This is achieved by learning of hierarchical representations from video to speech.Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages--content, timbre, and prosody modeling.In each stage, we align visual factors---lip movements, face identity, and facial expressions---with corresponding acoustic counterparts to ensure the seamless transformation.Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution.Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
Poster
Yinuo Wang · Yanbo Fan · Xuan Wang · Yu Guo · Fei Wang
[ ExHall D ]
Abstract
Listening head generation aims to synthesize non-verbal responsive listening head videos that naturally react to a certain speaker, for which, both realistic head movements, expressive facial expressions, and high visual qualities are expected. Previous approaches typically follow a two-stage pipeline that first generates intermediate 3D motion signals such as 3DMM coefficients, and then synthesizes the videos by deterministic rendering, suffering from limited motion expressiveness and low visual quality (eg, 256×256). In this work, we propose a novel listening head generation method that harnesses the generative capabilities of the diffusion model for both motion generation and high-quality rendering. Crucially, we propose an effective hybrid motion modeling module that addresses training difficulties caused by the scarcity of listening head data while preserving the intricate details that may be lost in explicit motion representations. We further develop a tailored control guidance for head pose and facial expression, by integrating their intrinsic motion characteristics. Our method enables high-fidelity video generation with 512×512 resolution and delivers vivid listener motion feedback. We conduct comprehensive experiments and obtain superior performance in terms of both visual quality and motion expressiveness compared to existing methods.
Poster
Enric Corona · Andrei Zanfir · Eduard Gabriel Bazavan · NIKOS KOLOTOUROS · Thiemo Alldieck · Cristian Sminchisescu
[ ExHall D ]
Abstract
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate.We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions.VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering lip-syncing, image quality, identity preservation and temporal consistency while also generating upper-body gestures.We analyze the performance of VLOGGER with respect to multiple diversity …
Poster
Zunnan Xu · Zhentao Yu · Zixiang Zhou · Jun Zhou · Xiaoyu Jin · Fa-Ting Hong · Xiaozhong Ji · Junwei Zhu · Chengfei Cai · Shiyu Tang · Qin Lin · Xiu Li · qinglin lu
[ ExHall D ]
Abstract
We introduce ImPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, ImPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion (SVD) as the main building block, we carefully design adapter layers to inject control signals into denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. ImPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability.
Poster
Jianwen Jiang · Gaojie Lin · Zhengkun Rong · Chao Liang · Yongming Zhu · Jiaqi Yang · Tianyun Zhong
[ ExHall D ]
Abstract
Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and by using simple U-Nets as backbones, our method achieves performance on par with state-of-the-art methods, while requiring less than one-tenth of the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs.
Poster
Wojciech Zielonka · Timo Bolkart · Thabo Beeler · Justus Thies
[ ExHall D ]
Abstract
Current personalized neural head avatars face a trade-off: lightweight models lack detail and realism, while high-quality, animatable avatars require significant computational resources, making them unsuitable for commodity devices. To address this gap, we introduce Gaussian Eigen Models (GEM), which provide high-quality, lightweight, and easily controllable head avatars. GEM utilizes 3D Gaussian primitives for representing the appearance combined with Gaussian splatting for rendering. Building on the success of mesh-based 3D morphable face models (3DMM), we define GEM as an ensemble of linear eigenbases for representing the head appearance of a specific subject. In particular, we construct linear bases to represent the position, scale, rotation, and opacity of the 3D Gaussians. This allows us to efficiently generate Gaussian primitives of a specific head shape by a linear combination of the basis vectors, only requiring a low-dimensional parameter vector that contains the respective coefficients. We propose to construct these linear bases (GEM) by distilling high-quality compute-intense CNN-based Gaussian avatar models that can generate expression-dependent appearance changes like wrinkles. These high-quality models are trained on multi-view videos of a subject and are distilled using a series of principle component analyses. Once we have obtained the bases that represent the animatable appearance space of a …
Poster
Zhenglin Zhou · Fan Ma · Hehe Fan · Tat-seng Chua
[ ExHall D ]
Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo-ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the video diffusion generation.To address this issue, we propose Symbiotic GENeration (SymGEN), a robust method that synthesizes spatial and temporal consistency datasets for 4D avatar reconstruction using the video diffusion model.Specifically, SymGEN iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that SymGEN improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation.The code will be publicly available.
Poster
Hyunsoo Cha · Inhee Lee · Hanbyul Joo
[ ExHall D ]
Abstract
We present PERSE, a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual's identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in the facial expression and viewpoint, combined with a variation in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on the 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving identity of reference person.
Poster
Zihao Huang · Shoukang Hu · Guangcong Wang · Tianqi Liu · Yuhang Zang · Zhiguo Cao · Wei Li · Ziwei Liu
[ ExHall D ]
Abstract
Existing research on avatar creation is typically limited to laboratory datasets, which require high costs against scalability and exhibit insufficient representation of the real world. On the other hand, the web abounds with off-the-shelf real-world human videos, but these videos vary in quality and require accurate annotations for avatar creation. To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. Our pipeline surpasses state-of-the-art methods on the EMDB benchmark, and the filtering protocols boost verification metrics on web videos. We then curate WildAvatar, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with 10,000+ different human subjects and scenes. WildAvatar is at least 10× richer than previous datasets for 3D human avatar creation and closer to the real world. To explore its potential, we demonstrate the quality and generalizability of avatar creation methods on WildAvatar. We will publicly release our code, data source links and annotations to push forward 3D human avatar creation and other related fields for real-world applications.
Poster
Hanxi Liu · Yifang Men · Zhouhui Lian
[ ExHall D ]
Abstract
Personalized 3D avatar editing holds significant promise due to its user-friendliness and availability to applications such as AR/VR and virtual try-ons. Previous studies have explored the feasibility of 3D editing, but often struggle to generate visually pleasing results, possibly due to the unstable representation learning under mixed optimization of geometry and texture in complicated reconstructed scenarios. In this paper, we aim to provide an accessible solution for ordinary users to create their editable 3D avatars with precise region localization, geometric adaptability, and photorealistic renderings. To tackle this challenge, we introduce a meticulously designed framework that decouples the editing process into local spatial adaptation and realistic appearance learning, utilizing a hybrid Tetrahedron-constrained Gaussian Splatting (TetGS) as the underlying representation. TetGS combines the controllable explicit structure of tetrahedral grids with the high-precision rendering capabilities of 3D Gaussian Splatting and is optimized in a progressive manner comprising three stages: 3D avatar instantiation from real-world monocular videos to provide accurate priors for TetGS initialization; localized spatial adaptation with explicitly partitioned tetrahedrons to guide the redistribution of Gaussian kernels; and geometry-based appearance generation with a coarse-to-fine activation strategy. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in generating photorealistic 3D …
Poster
Hang Ye · Xiaoxuan Ma · Hai Ci · Wentao Zhu · Yizhou Wang
[ ExHall D ]
Abstract
Achieving realistic animated human avatars requires accurate modeling of pose-dependent clothing deformations. Existing learning-based methods heavily rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like SMPL to model deformation. However, these methods struggle to handle loose clothing, such as long dresses, where the canonicalization process becomes ill-defined when the clothing is far from the body, leading to disjointed and fragmented results. To overcome this limitation, we propose a novel hybrid framework to model challenging clothed humans. Our core idea is to use dedicated strategies to model different regions, depending on whether they are close to or distant from the body. This free-form generation paradigm brings enhanced flexibility and expressiveness to our hybrid framework, enabling it to capture the intricate geometric details of challenging loose clothing, such as skirts and dresses. Experimental results on the benchmark dataset featuring loose clothing demonstrate that our method achieves state-of-the-art performance with superior visual fidelity and realism, particularly in the most challenging cases.
Poster
Chaoyue Song · Jianfeng Zhang · Xiu Li · Fan Yang · Yiwen Chen · Zhongcong Xu · Jun Hao Liew · Xiaoyang Guo · Fayao Liu · Jiashi Feng · Guosheng Lin
[ ExHall D ]
Abstract
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. We will release our dataset and model to support further research.
Poster
Peng Li · Wangguandong Zheng · Yuan Liu · Tao Yu · Yangguang Li · Xingqun Qi · Xiaowei Chi · Siyu Xia · Yan-Pei Cao · Wei Xue · Wenhan Luo · Yike Guo
[ ExHall D ]
Abstract
Photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, existing methods for monocular full-body reconstruction, typically relying on front and/or predicted back view, still struggle with satisfactory performance due to the ill-posed nature of the problem and sophisticated self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multiview normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experiments on CAPE and THuman2.1 datasets demonstrate PSHuman's superiority in geometry details, texture fidelity, and generalization capability.
Poster
Jiaqi Liu · Jichao Zhang · Paolo Rota · Nicu Sebe
[ ExHall D ]
Abstract
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model’s ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code will be released upon acceptance.
Poster
Nannan Zhang · Yijiang Li · Dong Du · Zheng Chong · Zhengwentai Sun · Jianhao Zeng · Yusheng Dai · Zhenyu Xie · Hairui Zhu · Xiaoguang Han
[ ExHall D ]
Abstract
This paper tackles the emerging challenge of multi-view virtual try-on, utilizing both front- and back-view clothing images as inputs. Extending frontal try-on methods to a multi-view context is not straightforward. Simply concatenating the two input views or encoding their features for a generative model, such as a diffusion model, often fails to produce satisfactory results. The main challenge lies in effectively extracting and fusing meaningful clothing features from these input views. Existing explicit warping-based methods, which establish direct correspondence between input and target views, tend to introduce artifacts, particularly when there is a significant disparity between the input and target views. Conversely, implicit encoding-based methods often lose spatial information about clothing, resulting in outputs that lack detail. To overcome these challenges, we propose Robust-MVTON, an end-to-end method for robust and high-quality multi-view try-ons. Our approach introduces a novel cross-pose feature alignment technique to guide the fusion of clothing features and incorporates a newly designed loss function for training. With the fused multi-scale clothing features, we employ a coarse-to-fine diffusion model to generate realistic and detailed results. Extensive experiments conducted on the Deepfashion and MPV datasets affirm the superiority of our method, achieving state-of-the-art performance.
Poster
Yang Zheng · Menglei Chai · Delio Vicini · Yuxiao Zhou · Yinghao Xu · Leonidas Guibas · Gordon Wetzstein · Thabo Beeler
[ ExHall D ]
Abstract
We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance.
Poster
Xingyu Ren · Jiankang Deng · Yuhao Cheng · Wenhan Zhu · Yichao Yan · Xiaokang Yang · Stefanos Zafeiriou · Chao Ma
[ ExHall D ]
Abstract
Recent 3D face reconstruction methods have made remarkable advancements, yet achieving high-quality facial reflectance from monocular input remains challenging. Existing methods rely on the light-stage captured data to learn facial reflectance models. However, limited subject diversity in these datasets poses challenges in achieving good generalization and broad applicability. This motivates us to explore whether the extensive priors captured in recent generative diffusion models (e.g., Stable Diffusion) can enable more generalizable facial reflectance estimation as these models have been pre-trained on large-scale internet image collections containing rich visual patterns. In this paper, we introduce the use of Stable Diffusion as a prior for facial reflectance estimation, achieving robust results with minimal captured data for fine-tuning. We present S3-Face, a comprehensive framework capable of producing SSS-compliant skin reflectance from in-the-wild images. Our method adopts a two-stage training approach: in the first stage, DSN-Net is trained to predict diffuse albedo, specular albedo, and normal maps from in-the-wild images using a novel joint reflectance attention module. In the second stage, HM-Net is trained to generate hemoglobin and melanin maps based on the diffuse albedo predicted in the first stage, yielding SSS-compliant and detailed reflectance maps. Extensive experiments demonstrate that our method achieves strong generalization …
Poster
Yizhilv · Xiao Lu · Hong Ding · Jingbo Hu · Zhi Jiang · Chunxia Xiao
[ ExHall D ]
Abstract
Eyeglass reflection removal can restore the texture information in the reflection destructed eye area, which is meaningful for various tasks on facial images. It is still challenging to correctly eliminate reflections, reasonably restore the lost contents, and guarantee that the final result has consistent color and illumination with the input image. In this paper, we introduce a Degradation-guided Local-to-Global (DL2G) restoration method to address this problem. We first propose a multiplicative reflection degradation model, which is used to alleviate the reflection degradation to obtain a preliminary result. Then, in the local details restoration stage, we propose a local structure-aware diffusion model to learn the true distribution of texture details in the eye area. This helps for recovering the lost contents in the heavy degradation regions where the background is invisible. Finally, in the global consistency refinement stage, we utilize the input image as reference image to generate the final result that is consistent with the input image in color and illumination. Extensive experiments demonstrate that our method can improve the effect of reflection removal and generate results with more reasonable semantics, exquisite details, and harmonious illumination.
Poster
yuxuan Gu · Huaian Chen · Yi Jin · Haoxuan Wang · Pengyang Ling · ZHIXIANG WEI · Enhong Chen
[ ExHall D ]
Abstract
In this paper, we observe that the collaboration of various foundation models can perceive semantic and degraded information within images, thereby guiding the low-light enhancement process. Specifically, we propose a self-supervised low-light enhancement framework based on the multiple foundation models collaboration (dubbed FoCo), aimed at improving both the visual quality of enhanced images and the performance in high-level applications. At the feature level, FoCo leverages the rich features from various foundation models to enhance the model's semantic perception during training, thereby reducing the gap between enhanced results and high-quality images from a high-level perspective. At the task level, we exploit the robustness-gap between strong foundation models and weak models, applying high-level task guidance to the low-light enhancement training process. Through the collaboration of multiple foundation models, the proposed framework shows better enhancement performance and adapts better to high-level tasks. Extensive experiments across various enhancement and application benchmarks demonstrate the qualitative and quantitative superiority of the proposed method over numerous state-of-the-art techniques.
Poster
Shuangfan Zhou · Chu Zhou · Youwei Lyu · Heng Guo · Zhanyu Ma · Boxin Shi · Imari Sato
[ ExHall D ]
Abstract
Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that make polarization-related parameters such as the degree of polarization (DoP) and angle of polarization (AoP) prone to error. Besides, limited by the hardware design, the resolution of a polarization camera is often much lower than that of a conventional RGB camera. Existing polarized image demosaicing (PID) methods are limited in that they cannot enhance resolution, while polarized image super-resolution (PISR) methods, though designed to obtain high-resolution (HR) polarized images from the demosaicing results, tend to retain or even amplify errors in the DoP and AoP introduced by demosaicing artifacts. In this paper, we propose PIDSR, a joint framework that performs complementary Polarized Image Demosaicing and Super-Resolution, showing the ability to robustly obtain high-quality HR polarized images with more accurate DoP and AoP from a CPFA raw image in a direct manner. Experiments show our PIDSR not only achieves state-of-the-art performance on both synthetic and real data, but also facilitates downstream tasks.
Poster
Yuhui Liu · Liangxun Ou · Qiang Fu · Hadi Amata · Wolfgang Heidrich · YIFAN PENG
[ ExHall D ]
Abstract
Extracting high-fidelity RGBD information from two-dimensional (2D) images is essential for various visual computing applications. Stereo imaging, as a reliable passive imaging technique for obtaining three-dimensional (3D) scene information, has benefited greatly from deep learning advancements. However, existing stereo depth estimation algorithms struggle to perceive high-frequency information and resolve high-resolution depth maps in realistic camera settings with large depth variations. These algorithms commonly neglect the hardware parameter configuration, limiting the potential for achieving optimal solutions solely through software-based design strategies.This work presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps over a wide depth range. A pair of rank-2 parameterized diffractive optical elements (DOEs) is employed to encode perpendicular complementary information optically during stereo acquisitions. Additionally, we employ an IGEV-UNet-fused neural network tailored to the proposed rank-2 encoding for stereo matching and image reconstruction. Through prototyping a stereo camera with customized DOEs, our deep stereo imaging paradigm has demonstrated superior performance over existing monocular and stereo imaging systems in both image PSNR by 2.96 dB gain and depth accuracy in high-frequency details across distances from 0.67 to 8 meters.
Poster
ZELIN LI · Chenwei Wang · Zhaoke Huang · Centre for Intelligent Multidimensional Data Analysis · Hong Kong Baptist University · Hong Kong Baptist University · Hong Kong Baptist University
[ ExHall D ]
Abstract
3D fluorescence microscopy is essential for understanding fundamental life processes through long-term live-cell imaging. However, due to inherent issues in imaging principles, it faces significant challenges including spatially varying noise and anisotropic resolution, where the axial resolution lags behind the lateral resolution up to 4.5 times. Meanwhile, laser power is kept low to maintain cell viability, leading to inaccessible low-noise and high-resolution paired ground truth (GT). To tackle these limitations, a dual Cycle-consistent Diffusion is proposed to effectively mine intra-volume imaging priors within 3D cell volumes in an unsupervised manner, i.e., Volume Tells (VTCD), achieving de-noising and super-resolution (SR) simultaneously. Specifically, a spatially iso-distributed denoiser is designed to exploit the noise distribution consistency between adjacent low-noise and high-noise regions within the 3D cell volume, suppressing the spatially varying noise.Then, in light of the structural consistency of the cell volume, a cross-plane global-propagation SR module propagates high-resolution details from the XY plane into adjacent regions in the XZ and YZ planes, progressively enhancing resolution across the entire 3D cell volume.Experimental results on 10 in vivo cellular dataset demonstrate high improvements in both denoising and super-resolution, with axial resolution enhanced from ~ 430 nm to ~ 90 nm.
Poster
Jungho Lee · Suhwan Cho · Taeoh Kim · Ho-Deok Jang · Minhyeok Lee · Geonho Cha · Dongyoon Wee · Dogyoon Lee · Sangyoun Lee
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has attracted significant attention for its high-quality novel view rendering, inspiring research to address real-world challenges. While conventional methods depend on sharp images for accurate scene reconstruction, real-world scenarios are often affected by defocus blur due to finite depth of field, making it essential to account for realistic 3D scene representation. In this study, we propose CoCoGaussian, a \textbf{C}ircle \textbf{o}f \textbf{Co}nfusion-aware \textbf{Gaussian} Splatting that enables precise 3D scene representation using only defocused images. CoCoGaussian addresses the challenge of defocus blur by modeling the Circle of Confusion (CoC) through a physically grounded approach based on the principles of photographic defocus. Exploiting 3D Gaussians, we compute the CoC diameter from depth and learnable aperture information, generating multiple Gaussians to precisely capture the CoC shape. Furthermore, we introduce a learnable scaling factor to enhance robustness and provide more flexibility in handling unreliable depth in scenes with reflective or refractive surfaces. Experiments on both synthetic and real-world datasets demonstrate that CoCoGaussian achieves state-of-the-art performance across multiple benchmarks.
Poster
Zixuan Chen · Yujin Wang · Xin Cai · Zhiyuan You · Zhe-Ming Lu · Fan Zhang · Shi Guo · Tianfan Xue
[ ExHall D ]
Abstract
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we …
Poster
Xiaoyu Zhang · Weihong Pan · Chong Bao · Xiyu Zhang · Xiaojun Xiang · Hanqing Jiang · Hujun Bao
[ ExHall D ]
Abstract
Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, limited to balance both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene’s frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details.
Poster
Jiajun Tang · Fan Fei · Zhihao Li · Xiao Tang · Shiyong Liu · Youyu Chen · Binxiao Huang · Dave Zhenyu Chen · Xiaofei Wu · Boxin Shi
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS), a recently emerged multi-view 3D reconstruction technique, has shown significant advantages in real-time rendering and explicit editing. However, 3DGS encounters challenges in the accurate modeling of both high-frequency view-dependent appearances and global illumination effects, including inter-reflection. This paper introduces SpecTRe-GS, which addresses these challenges and models highly Specular surfaces that reflect nearby objects through Tracing Rays in 3D Gaussian Splatting. SpecTRe-GS separately models reflections from highly specular and rough surfaces to leverage the distinctions between their reflective properties, integrating an efficient ray tracer within the 3DGS framework for querying secondary rays, thus achieving fast and accurate rendering. Also, it incorporates normal prior guidance and joint geometry optimization at various stages of the training process to enhance geometry reconstruction for undistorted reflections. The proposed SpecTRe-GS demonstrates superior performance compared to existing 3DGS-based methods in capturing highly specular inter-reflections, as confirmed by experiments conducted on both synthetic and real-world scenes. We also showcase the editing applications enabled by the scene decomposition capabilities of SpecTRe-GS.
Poster
Hanxiao Sun · Yupeng Gao · Jin Xie · Jian Yang · Beibei Wang
[ ExHall D ]
Abstract
Reconstructing 3D assets from images, known as inverse rendering (IR), remains a challenging task due to its ill-posed nature and the complexities of appearance and lighting. 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities for novel view synthesis (NVS) tasks. It has also been introduced into relighting by decoupling radiance into Bidirectional Reflectance Distribution Function (BRDF) parameters and environmental lighting. Unfortunately, these methods often produce inferior relighting quality, exhibiting visible artifacts and unnatural indirect illumination. The main reason is the limited capability of each Gaussian, which has constant material parameters and normal, alongside the absence of physical constraints for indirect lighting. In this paper, we present a novel framework called Curved Gaussian Inverse Rendering (CG-IR), aimed at enhancing both NVS and relighting quality. To this end, we propose a new representation—Curved Gaussian (CG)—that generalizes per-Gaussian constant material parameters to allow for spatially varying parameters, indicating that different regions of each Gaussian can have various normals and material properties. This enhanced representation is complemented by a CG splatting scheme akin to vertex/fragment shading in traditional graphics pipelines. Furthermore, we integrate a physically-based indirect lighting model, enabling more realistic relighting. The proposed CG-IR framework significantly improves rendering quality, outperforming state-of-the-art NeRF-based methods …
Poster
Qiyu Dai · Xingyu Ni · Qianfan Shen · Mengyu Chu · Wenzheng Chen · Baoquan Chen
[ ExHall D ]
Abstract
We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 10 fps, offering users flexible control over rain intensity—from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and …
Poster
Ludwic Leonard · Nils Thuerey · rüdiger westermann
[ ExHall D ]
Abstract
We introduce a single-view reconstruction technique of volumetric fields in which multiple light scattering effects are omnipresent, such as in clouds. We model the unknown distribution of volumetric fields using an unconditional diffusion model trained on a novel benchmark dataset comprising 1,000 synthetically simulated volumetric density fields. The neural diffusion model is trained on the latent codes of a novel, diffusion-friendly, monoplanar representation. The generative model is used to incorporate a tailored parametric diffusion posterior sampling technique into different reconstruction tasks. A physically-based differentiable volume renderer is employed to provide gradients with respect to light transport in the latent space. This stands in contrast to classic NeRF approaches and makes the reconstructions better aligned with observed data. Through various experiments, we demonstrate single-view reconstruction of volumetric clouds at a previously unattainable quality.
Poster
Juan Rodriguez · Abhay Puri · Shubham Agarwal · Issam Laradji · Pau Rodriguez · Sai Rajeswar · David Vazquez · Christopher Pal · Marco Pedersoli
[ ExHall D ]
Abstract
Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond \textit{path} curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and three tasks: image vectorization, text-driven SVG generation, and diagram generation. Using this contribution, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.
Poster
Cheng Sun · Jaesung Choe · Charles Loop · Wei-Chiu Ma · Yu-Chiang Frank Wang
[ ExHall D ]
Abstract
We propose an efficient radiance field rendering algorithm that incorporates a rasterization process on sparse voxels without neural networks or 3D Gaussians. There are two key contributions coupled with the proposed system.The first is to render sparse voxels in the correct depth order along pixel rays by using dynamic Morton ordering. This avoids the well-known popping artifact found in Gaussian splatting. Second, we adaptively fit sparse voxels to different levels of detail within scenes, faithfully reproducing scene details while achieving high rendering frame rates.Our method improves the previous neural-free voxel grid representation by over 4db PSNR and more than 10x rendering FPS, achieving state-of-the-art comparable novel-view synthesis results.Additionally, our neural-free sparse voxels are seamlessly compatible with grid-based 3D processing algorithms.We achieve promising mesh reconstruction accuracy by integrating TSDF-Fusion and Marching Cubes into our sparse grid system.
Poster
Minye Wu · Haizhao Dai · Kaixin Yao · Jingyi Yu · Tinne Tuytelaars
[ ExHall D ]
Abstract
Differentiable rendering enables efficient optimization by allowing gradients to be computed through the rendering process, facilitating 3D reconstruction, inverse rendering and neural scene representation learning. To ensure differentiability, existing solutions approximate or re-formulate traditional rendering operations using smooth, probabilistic proxies such as volumes or Gaussian primitives. Consequently, they struggle to preserve sharp edges due to the lack of explicit boundary definitions. We present a novel hybrid representation, Bézier Gaussian Triangle (BG-Triangle), that combines Bézier triangle-based vector graphics primitives with Gaussian-based probabilistic models, to maintain accurate shape modeling while conducting resolution-independent differentiable rendering. We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. We also employ an adaptive densification and pruning scheme for efficient training while reliably handling level-of-detail (LoD) variations. Experiments show that BG-Triangle achieves comparable rendering quality as 3DGS but with superior boundary preservation. More importantly, BG-Triangle uses a much smaller number of primitives than its alternatives, showcasing the benefits of vectorized graphics primitives and the potential to bridge the gap between classic and emerging representations.
Poster
Himangi Mittal · Peiye Zhuang · Hsin-Ying Lee · Shubham Tulsiani
[ ExHall D ]
Abstract
We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation' i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material information. Compared to prior neural constitutive modeling approaches which learn scene specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials -- elastic, plasticine, sand, and fluids~(Newtonian & non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.
Poster
Kaiwei Zhang · Dandan Zhu · Xiongkuo Min · Guangtao Zhai
[ ExHall D ]
Abstract
Mesh saliency enhances the adaptability of 3D vision by identifying and emphasizing regions that naturally attract visual attention. To investigate the interaction between geometric structure and texture in shaping visual attention, we establish a comprehensive mesh saliency dataset, which is the first to systematically capture the differences in saliency distribution under both textured and non-textured visual conditions. Furthermore, we introduce mesh Mamba, a unified saliency prediction model based on a state space model (SSM), designed to adapt across various mesh types. Mesh Mamba effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. More importantly, by subgraph embedding and a bidirectional SSM, the model enables global context modeling for both local geometry and texture, preserving the topological structure and improving the understanding of visual details and structural complexity. Through extensive theoretical and empirical validation, our model not only improves performance across various mesh types but also demonstrates high scalability and versatility, particularly through cross validations of various visual features.
Poster
Xiaoliang Ju · Hongsheng Li
[ ExHall D ]
Abstract
We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels.To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects.The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results.
Poster
Mark Boss · Zixuan Huang · Aaryaman Vasishta · Varun Jampani
[ ExHall D ]
Abstract
We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. Unlike most existing approaches, SF3D is explicitly trained for mesh generation, incorporating a fast UV unwrapping technique that enables swift texture generation rather than relying on vertex colors. The method also learns to predict material parameters and normal maps to enhance the visual quality of the reconstructed 3D meshes. Furthermore, SF3D integrates a delighting step to effectively remove low-frequency illumination effects, ensuring that the reconstructed meshes can be easily used in novel illumination conditions. Experiments demonstrate the superior performance of SF3D over the existing techniques.
Poster
Rui Chen · Jianfeng Zhang · Yixun Liang · Guan Luo · Weiyu Li · Jiarui Liu · Xiu Li · Xiaoxiao Long · Jiashi Feng · Ping Tan
[ ExHall D ]
Abstract
Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely-adopted uniform point sampling strategy in Shape VAE training often leads to significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks.We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches.To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8× smaller (1,280 vs. ≥ 10,000 codes).We will release our code and benchmark dataset to facilitate future research in 3D shape modeling.
Poster
Suizhi Huang · Xingyi Yang · Hongtao Lu · Xinchao Wang
[ ExHall D ]
Abstract
Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.
Poster
Amir Barda · Matheus Gadelha · Vladimir G. Kim · Noam Aigerman · Amit H. Bermano · Thibault Groueix
[ ExHall D ]
Abstract
We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in ≈3 seconds, without the need for running an SDS type of optimization.Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.
Poster
Sinisa Stekovic · Arslan Artykov · Stefan Ainetter · Mattia Durso · Friedrich Fraundorfer
[ ExHall D ]
Abstract
We propose PyTorchGeoNodes, a differentiable module for 3D object reconstruction from images using interpretable shape programs. Unlike traditional CAD model retrieval, shape programs allow semantic reasoning, editing, and a low memory footprint. Despite their potential, shape programs for 3D scene understanding have been largely overlooked. Our key contribution is enabling gradient-based optimization by translating shape programs, like those in Blender, into efficient PyTorch code. Additionally, we show that a combination of PyTorchGeoNodes with Genetic Algorithms is a method of choice to optimize both discrete and continuous shape program parameters for 3D reconstruction, and can be further integrated with other reconstruction algorithms such as Gaussian Splats. Our experiments on the ScanNet dataset show that our method achieves accurate reconstructions.
Poster
Susung Hong · Johanna Suvi Karras · Ricardo Martin · Ira Kemelmacher-Shlizerman
[ ExHall D ]
Abstract
The fields of 3D reconstruction and text-based 3D editing have advanced significantly with the evolution of text-based diffusion models. While existing 3D editing methods excel at modifying color, texture, and style, they struggle with extensive geometric or appearance changes, thus limiting their applications. We propose \textbf{Perturb-and-Revise}, which makes possible a variety of NeRF editing. First, we \textbf{perturb} the NeRF parameters with random initializations to create a versatile initialization. We automatically determine the perturbation magnitude through analysis of the local loss landscape. Then, we \textbf{revise} the edited NeRF via generative trajectories. Combined with the generative process, we impose identity-preserving gradients to refine the edited NeRF. Extensive experiments demonstrate that Perturb-and-Revise facilitates flexible, effective, and consistent editing of color, appearance, and geometry in 3D without model retraining.
Poster
Yufei Huang · Bangyan Liao · Yuqi Hu · Haitao Lin · Lirong Wu · Siyuan Li · Cheng Tan · Zicheng Liu · Yunfan Liu · Zelin Zang · Chang Yu · Zhen Lei
[ ExHall D ]
Abstract
Score Distillation Sampling (SDS) has been successfully extended to text-driven 3D scene editing with 2D pretrained diffusion models. However, SDS-based editing methods suffer from lengthy optimization processes with slow inference and low quality. We attribute the issue of lengthy optimization to the stochastic optimization scheme used in SDS-based editing, where many steps may conflict with each other (e.g., the inherent trade-off between editing and preservation). To reduce this internal conflict and speed up the editing process, we propose to separate editing and preservation in time with a diffusion time schedule and frame the 3D editing optimization process as a diffusion bridge sampling process. Motivated by the analysis above, we introduce DaCapo, a fast diffusion sampling-like 3D editing method that incorporates a novel stacked bridge framework, which estimates a direct diffusion bridge between source and target distribution with only a pretrained 2D diffusion model. Specifically, It models the editing process as a combination of inversion and generation, where both processes happen simultaneously as a stack of Diffusion Bridges. DaCapo shows a 15× speed-up with comparable results to the state-of-the-art SDS-based method. It completes the process in just 2,500 steps on a single GPU and accommodates a variety of 3D representation methods.
Poster
Takuhiro Kaneko
[ ExHall D ]
Abstract
Recent advancements in neural 3D representations, such as neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS), have made accurate estimation of the 3D structure from multiview images possible. However, this capability is limited to estimating the visible external structure, and it is still difficult to identify the invisible internal structure hidden behind the surface. To overcome this limitation, we address a new task called structure from collision (SfC), which aims to estimate the structure (including the invisible internal one) of an object from the appearance changes at collision. To solve this task, we propose a novel model called SfC-NeRF, which optimizes the invisible internal structure (i.e., internal volume density) of the object through a video sequence under physical, appearance (i.e., visible external structure)-preserving, and key-frame constraints. In particular, to avoid falling into undesirable local optima owing to its ill-posed nature, we propose volume annealing, i.e., searching for the global optima by repeatedly reducing and expanding the volume. Extensive experiments on 60 cases involving diverse structures (i.e., various cavity shapes, locations, and sizes) and various material properties reveal the properties of SfC and demonstrate the effectiveness of the proposed SfC-NeRF.
Poster
Zixuan Chen · Guangcong Wang · Jiahao Zhu · Jianhuang Lai · Xiaohua Xie
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has recently created impressive assets for various applications.However, the copyright of these assets is not well protected as existing watermarking methods are not suited for 3DGS considering security, capacity, and invisibility.Besides, these methods often require hours or even days for optimization, limiting the application scenarios.In this paper, we propose **GuardSplat**, an innovative and efficient framework that effectively protects the copyright of 3DGS assets.Specifically, **1)** We first propose a CLIP-guided Message Decoupling Optimization module for training the message decoder, leveraging CLIP's aligning capability and rich representations to achieve a high extraction accuracy with minimal optimization costs, presenting exceptional **capability** and **efficiency**.**2)** Then, we propose a Spherical-harmonic-aware (SH-aware) Message Embedding module tailored for 3DGS, which employs a set of SH offsets to seamlessly embed the message into the SH features of each 3D Gaussian while maintaining the original 3D structure.It enables the 3DGS assets to be watermarked with minimal fidelity trade-offs and prevents malicious users from removing the messages from the model files, meeting the demands for **invisibility** and **security**.**3)** We further propose an Anti-distortion Message Extraction module to improve **robustness** against various visual distortions.Extensive experiments demonstrate that **GuardSplat** outperforms the state-of-the-art methods and achieves fast optimization speed.The …
Poster
Zhengxue Wang · Zhiqiang Yan · Jinshan Pan · Guangwei Gao · Kai Zhang · Jian Yang
[ ExHall D ]
Abstract
Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradations, outperforming existing methods.
Poster
Hengyu Liu · Yuehao Wang · Chenxin Li · Ruisi Cai · Kevin Wang · Wuyang Li · Pavlo Molchanov · Peihao Wang · Zhangyang Wang
[ ExHall D ]
Abstract
3D Gaussian splatting (3DGS) has enabled various applications in 3D scene representation and novel view synthesis due to its efficient rendering capabilities. However, 3DGS demands significant GPU memory, limiting its use on devices with restricted computational resources. Previous approaches have focused on pruning less important Gaussians, effectively compressing 3DGS but often requiring a fine-tuning stage and lacking adaptability for the specific memory needs of different devices. In this work, we present an elastic inference method for 3DGS. Given an input for the desired model size, our method selects and transforms a subset of Gaussians, achieving substantial rendering performance without additional fine-tuning. We introduce a tiny learnable module that controls Gaussian selection based on the input percentage, along with a transformation module that adjusts the selected Gaussians to complement the performance of the reduced model. Comprehensive experiments on ZipNeRF, MipNeRF and Tanks\&Temples scenes demonstrate the effectiveness of our approach. Code will be publicly available.
Poster
You Shen · Zhipeng Zhang · Xinyang Li · Yansong Qu · Yu Lin · Shengchuan Zhang · Liujuan Cao
[ ExHall D ]
Abstract
Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry reconstruction remains difficult. Existing methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks. However, the former suffers from inherent conflicts between rendering and reconstruction, while the latter is computationally and storage-intensive. To address these challenges, we propose CarGS, a unified model leveraging Contribution adaptive regularization to achieve simultaneous, high-quality rendering and surface reconstruction. The essence of our framework is learning adaptive contribution for Gaussian primitives by squeezing the knowledge from geometry regularization into a compact MLP. Additionally, we introduce a geometry-guided densification strategy with clues from both normals and Signed Distance Fields (SDF) to improve the capability of capturing high-frequency details. Our design improves the mutual learning of the two tasks, meanwhile its unified structure doesn't require separate models as in dual-model based approaches, guaranteeing efficiency. Extensive experiments demonstrate CarGS’s ability to achieve state-of-the-art (SOTA) results in both rendering fidelity …
Poster
Suyoung Lee · JAEYOUNG CHUNG · Kihoon Kim · Jaeyoo Huh · Gunhee Lee · Minsoo Lee · Kyoung Mu Lee
[ ExHall D ]
Abstract
Feed-forward 3D Gaussian Splatting (3DGS) models have gained significant popularity due to their ability to generate scenes immediately without needing per-scene optimization.Although omnidirectional images are getting more popular since they reduce the computation for image stitching to composite a holistic scene, existing feed-forward models are only designed for perspective images.The unique optical properties of omnidirectional images make it difficult for feature encoders to correctly understand the context of the image and make the Gaussian non-uniform in space, which hinders the image quality synthesized from novel views.We propose OmniSplat, a pioneering work for fast feed-forward 3DGS generation from a few omnidirectional images.We introduce Yin-Yang grid and decompose images based on it to reduce the domain gap between omnidirectional and perspective images.The Yin-Yang grid can use the existing CNN structure as it is, but its quasi-uniform characteristic allows the decomposed image to be similar to a perspective image, so it can exploit the strong prior knowledge of the learned feed-forward network.OmniSplat demonstrates higher reconstruction accuracy than existing feed-forward networks trained on perspective images.Furthermore, we enhance the segmentation consistency between omnidirectional images by leveraging attention from the encoder of OmniSplat, providing fast and clean 3DGS editing results.
Poster
Chung-Ho Wu · Yang-Jung Chen · Ying-Huan Chen · Jie-Ying Lee · Bo-Hsu Ke · Chun-Wei Tuan Mu · Yichuan Huang · Chin-Yang Lin · Min-Hung Chen · Yen-Yu Lin · Yu-Lun Liu
[ ExHall D ]
Abstract
Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method leveraging 3D Gaussian Splatting for high-quality object removal and hole filling. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion for geometric consistency, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.
Poster
Chong Bao · Xiyu Zhang · Zehao Yu · Jiale Shi · Guofeng Zhang · Songyou Peng · Zhaopeng Cui
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view rendering with dense input views and accurate poses. However, applying 3DGS to sparse, unposed views in unbounded 360-degree scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360-degree scenes.To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers.By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a reconstruction bootstrap optimization that refines the noise and distortions in the coarse geometry.Furthermore, we propose an iterative fusion of reconstruction and generation, facilitating mutual conditioning and enhancement between these two processes.Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy.
Poster
Nicole Meng · Caleb Manicke · Ronak Sahu · Caiwen Ding · Yingjie Lao
[ ExHall D ]
Abstract
Generalizable Neural Radiance Fields (GNeRF) are recognized as one of the most promising techniques for novel view synthesis and 3D model generation in real-world applications. However, like other generative models in computer vision, ensuring their adversarial robustness against various threat models is essential for practical use. The pioneering work in this area, NeRFool, introduced a state-of-the-art attack that targets GNeRFs by manipulating source views before feature extraction, successfully disrupting the color and density results of the constructed views. Building on this foundation, we propose IL2-NeRF(Iterative L2 NeRF Attack), a novel adversarial attack method that explores a new threat model(in the L2 domain) for attacking GNeRFs. We evaluated IL2-NeRF against two standard GNeRF models across three benchmark datasets, demonstrating similar performance compared to NeRFool, based on the same evaluation metrics proposed by NeRFool. Our results establish IL2-NeRF as the first adversarial method for GNeRFs under the L2 norm. We establish a foundational L2 threat model for future research, enabling direct performance comparisons while introducing a smoother, image-wide perturbation approach in Adversarial 3D Reconstruction.
Poster
Jiahe Li · Feiyu Wang · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Ting Liu
[ ExHall D ]
Abstract
Gaussian Splatting (GS)-based methods rely on sufficient training view coverage and perform synthesis on interpolated views. In this work, we tackle the more challenging and underexplored Extrapolated View Synthesis (EVS) task. Here we enable GS-based models trained with limited view coverage to generalize well to extrapolated views. To achieve our goal, we propose a view augmentation framework to guide training through a coarse-to-fine process. At the coarse stage, we reduce rendering artifacts due to insufficient view coverage by introducing a regularization strategy at both appearance and geometry levels. At the fine stage, we generate reliable view priors to provide further training guidance. To this end, we incorporate an occlusion awareness into the view prior generation process, and refine the view priors with the aid of coarse stage output. We call our framework Enhanced View Prior Guidance for Splatting (EVPGS). To comprehensively evaluate EVPGS on the EVS task, we collect a real-world dataset called Merchandise3D dedicated to the EVS scenario. Experiments on three datasets including both real and synthetic demonstrate EVPGS achieves state-of-the-art performance, while improving synthesis quality at extrapolated views for GS-based methods both qualitatively and quantitatively. We will make our code, dataset, and models public.
Poster
Xiaoding Yuan · Shitao Tang · Kejie Li · Peng Wang
[ ExHall D ]
Abstract
This paper introduces Camera-free Diffusion (CamFreeDiff) model for 360∘ image outpainting from a single camera-free image and text description. This method distinguishes itself from existing strategies, such as MVDiffusion, by eliminating the requirement for predefined camera poses. CamFreeDiff seamlessly incorporates a mechanism for predicting homography within the multi-view diffusion framework. The key component of our approach is to formulate camera estimation by directly predicting the homography transformation from the input view to the predefined canonical view. In contrast to the direct two-stage approach of image transformation and outpainting, CamFreeDiff utilizes predicted homography to establish point-level correspondences between the input view and the target panoramic view. This enables consistency through correspondence-aware attention, which is learned in a fully differentiable manner. Qualitative and quantitative experimental results demonstrate the strong robustness and performance of CamFreeDiff for 360∘ image outpainting in the challenging context of camera-free inputs.
Poster
Yash Kant · Ethan Weber · Jin Kyu Kim · Rawal Khirodkar · Zhaoen Su · Julieta Martinez · Igor Gilitschenski · Shunsuke Saito · Timur Bagautdinov
[ ExHall D ]
Abstract
We present Pippo, a generative model capable of producing a dense set of high-resolution (1K) multi-view images of a person from a single photo. Our approach does not require any parametric model fitting or camera parameters of the input image, and generalizes to arbitrary identities with diverse clothing and hair styles. Pippo is a multi-view diffusion transformer trained in multiple stages. First, we pretrain the model on a billion-scale human-centric image dataset. Second, we train the model on studio data to generate many low-resolution consistent views conditioned on a coarse camera and an input image. Finally, we fine-tune the model on high-resolution data for multi-view generation with minimal placement controls, further improving consistency. This training strategy allows us to retain the generalizability from the large-scale pretraining while enabling high-resolution multi-view synthesis. We investigate several key architecture design choices for multi-view generation with diffusion transformers for precise view and identity control. Using a newly introduced 3D consistency metric, we demonstrate that Pippo outperforms existing approaches on multi-view human generation from a single image.
Poster
Yihang Luo · Shangchen Zhou · Yushi Lan · Xingang Pan · Chen Change Loy
[ ExHall D ]
Abstract
Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and multi-view row attention and epipolar aggregation modules to ensure high-quality, consistent 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence under diverse viewing angles. Extensive evaluations demonstrate that 3DEnhancer significantly outperforms existing methods, improving both multi-view enhancement and per-instance 3D optimization tasks. Code and model will be publicly available.
Poster
Hanwen Jiang · Zexiang Xu · Desai Xie · Chen Ziwen · Haian Jin · Fujun Luan · ZHIXIN SHU · Kai Zhang · Sai Bi · Xin Sun · Jiuxiang Gu · Qixing Huang · Georgios Pavlakos · Hao Tan
[ ExHall D ]
Abstract
We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes—over 50 times larger than the prior real dataset DL3DV—dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization. We will make our data and the data generation code public upon publication.
Poster
Haofei Xu · Songyou Peng · Fangjinhua Wang · Hermann Blum · Daniel Barath · Andreas Geiger · Marc Pollefeys
[ ExHall D ]
Abstract
Gaussian splatting and single/multi-view depth estimation are typically studied in isolation. In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. More specifically, we first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features, leading to high-quality feed-forward 3D Gaussian splatting reconstructions. We also show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models from large-scale unlabeled datasets. We validate the synergy between Gaussian splatting and depth estimation through extensive ablation and cross-task transfer experiments. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in terms of both depth estimation and novel view synthesis, demonstrating the mutual benefits of connecting both tasks. We invite the readers to view our supplementary video for feed-forward reconstruction results of large-scale or 360 scenes from up to 12 input views at 512×960 resolutions. Our code and models will be publicly available.
Poster
Fan Yang · Jianfeng Zhang · Jun Hao Liew · Chaoyue Song · Zhongcong Xu · Xiu Li · Jiashi Feng · Guosheng Lin
[ ExHall D ]
Abstract
Multi-view image synthesis models are limited by a lack of training data. Fine-tuning well-trained video generative models to generate 360-degree videos of objects offers a promising solution, as they inherit the strong generative priors from the pretrained knowledge. However, these methods often face computational bottlenecks due to the large number of viewpoints, with temporal attention mechanisms often used to mitigate this. Unfortunately, such techniques can introduce artifacts like 3D inconsistency and over-smoothing. To overcome this, we propose a novel sparsification approach that reduces the video diffusion model into sparse view synthesis. We first extract rich geometric priors from pretrained video diffusion models and then conduct high-fidelity sparse multi-view synthesis to improve the 3D consistency. Extensive experiments show that our approach achieves superior efficiency, generalization, and consistency, outperforming state-of-the-art multi-view synthesis methods
Poster
Alex Trevithick · Roni Paiss · Philipp Henzler · Dor Verbin · Rundi Wu · Hadi Alzayer · Ruiqi Gao · Ben Poole · Jonathan T. Barron · Aleksander Holynski · Ravi Ramamoorthi · Pratul P. Srinivasan
[ ExHall D ]
Abstract
Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies.
Poster
Hanyang Wang · Fangfu Liu · Jiawei Chi · Yueqi Duan
[ ExHall D ]
Abstract
Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our …
Poster
Liyan Chen · Huangying Zhan · Kevin Chen · Xiangyu Xu · Qingan Yan · Changjiang Cai · Yi Xu
[ ExHall D ]
Abstract
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments.The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity.Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER's effectiveness in active mapping tasks.
Poster
Dongrui Dai · Yuxiang Xing
[ ExHall D ]
Abstract
3D Gaussian splatting (3DGS) has shown impressive performance in 3D scene reconstruction. However, it suffers from severe degradation when the number of training views is limited, resulting in blur and floaters. Many works have been devoted to standardize the optimization process of 3DGS through regularization techniques. However, we identify that inadequate initialization is a critical issue overlooked by current studies. To address this, we propose EAP-GS, a method to enhance initialization for fast, accurate, and stable few-shot scene reconstruction. Specifically, we introduce an Attentional Pointcloud Augmentation (APA) technique, which retains two-view tracks as an option for pointcloud generation. Additionally, the scene complexity is used to determine the required density distribution, thereby constructing a better pointcloud. We implemented APA by extending Structure-From-Motion (SFM) to focus on pointcloud generation in regions with complex structure but sparse pointcloud distribution, dramatically increasing the number of valuable points and effectively harmonizing the density distribution. A better pointcloud leads to more accurate scene geometry and mitigates local overfitting during reconstruction stage. Experimental results on forward-facing datasets from various indoor and outdoor scenes demonstrate that the proposed EAP-GS achieves outstanding scene reconstruction performance and surpasses state-of-the-art methods.
Poster
Guoyu Lu
[ ExHall D ]
Abstract
Scene reconstruction has a wide range of applications in computer vision and robotics. To build practical constraints and feature correspondences, rich textures and distinguished gradient variations are particularly required in classic and learning-based SfM. When building low-texture regions with repeated patterns, especially mostly white indoor rooms, there is a significant drop in the performance. Inthis work, we propose Shading-SfM-Net (Shading & structure-from motion network), a novel framework for simultaneously learninga shape-from-shading network based on the inverse rendering constraint and a structure-from-motion framework based on warped keypoint and geometric consistency, to improve structure-from-motion and surface reconstruction for low-texture indoor scenes. Shading-SfM-Net tightly incorporates the surface shape consistency and 3D geometric registration loss in order to dig into their mutual informationand further overcome the instability on flat regions. We evaluate the proposed framework on texture-less indoor scenes (NYU v2and ScanNet), and show that for each individual network without simultaneous training, our method is able to achieve comparableresult to the state of the art methods. By simultaneously learning shape and motion from the two networks, our pipeline is able toachieve state-of-the-art performance with superior generalization capability for unseen texture-less datasets.
Poster
Jinbo Yan · Rui Peng · Zhiyan Wang · Luyang Tang · Jiayu Yang · Jie Liang · Jiahao Wu · Ronggang Wang
[ ExHall D ]
Abstract
Building Free-Viewpoint Videos in a streaming manner offers the advantage of rapid responsiveness compared to offline training methods, greatly enhancing user experience. However, current streaming approaches face challenges of high per-frame reconstruction time (10s+) and error accumulation, limiting their broader application. In this paper, we propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework, to address these issues. First, we introduce a generalized Anchor-driven Gaussian Motion Network, which projects multi-view 2D motion features into 3D space, using anchor points to drive the motion of all Gaussians. This generalized Network generates the motion of Gaussians for each target frame in the time required for a single inference. Second, we propose a Key-frame-guided Streaming Strategy that refines each key frame, enabling accurate reconstruction of temporally complex scenes while mitigating error accumulation. We conducted extensive in-domain and cross-domain evaluations, demonstrating that our approach can achieve streaming with a average per-frame reconstruction time of 2s+, alongside a enhancement in view synthesis quality.
Poster
Yiren Lu · Yunlai Zhou · Disheng Liu · tuo liang · Yu Yin
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which is not that trivial to fulfill in the real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.
Poster
Chengwei Zheng · Lixin Xue · Juan Jose Zarate · Jie Song
[ ExHall D ]
Abstract
3D Gaussian Splatting techniques have enabled efficient photo-realistic rendering of static scenes. Recent works have extended these approaches to support surface reconstruction and tracking. However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GSTAR maintains the mesh topology and tracks the meshes using Gaussians. In regions where topology changes, GSTAR adaptively unbinds Gaussians from the mesh, enabling accurate registration and the generation of new surfaces based on these optimized Gaussians. Additionally, we introduce a surface-based scene flow method that provides robust initialization for tracking between frames. Experiments demonstrate that our method effectively tracks and reconstructs dynamic surfaces, enabling a range of applications. We will release our implementation to facilitate future research.
Poster
Sotiris Nousias · Mian Wei · Howard Xiao · Maxx Wu · Shahmeer Athar · Kevin J Wang · Anagh Malik · David A. Barmherzig · David B. Lindell · Kiriakos Kutulakos
[ ExHall D ]
Abstract
Scattered light from pulsed lasers is increasingly part of our ambient illumination, as many devices rely on them for active 3D sensing. In this work, we ask: can these “ambient” light signals be detected and leveraged for passive 3D vision? We show that pulsed lasers, despite being weak and fluctuating at MHz to GHz frequencies, leave a distinctive sinc comb pattern in the temporal frequency domain of incident flux that is specific to each laser and invariant to the scene. This enables their passive detection and analysis with a free-running SPAD camera, even when they are unknown, asynchronous, out of sight, and emitting concurrently. We show how to synchronize with such lasers computationally, characterize their pulse emissions, separate their contributions, and—if many are present—localize them in 3D and recover a depth map of the camera’s field of view. We use our camera prototype to demonstrate (1) a first-of-its-kind visualization of asynchronously propagating light pulses from multiple lasers through the same scene, (2) passive estimation of a laser’s MHz-scale pulse repetition frequency with mHz precision, and (3) mm-scale 3D imaging over room-scale distances by passively harvesting photons from two or more out-of-view lasers.
Poster
Zhengxian Yang · Shi Pan · Shengqi Wang · Haoxiang Wang · Li Lin · Guanjun Li · Zhengqi Wen · Borong Lin · Jianhua Tao · Tao Yu
[ ExHall D ]
Abstract
User engagement is greatly enhanced by fully immersive multimodal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interactive space, Multi-modal feedback, and high resolution\&frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce **ImViD**, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, which significantly enhances the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multimodal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
Poster
Peter Kulits · Michael J. Black · Silvia Zuffi
[ ExHall D ]
Abstract
The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes composed of trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting important environmental context. This limits their usefulness for analysis tasks, as animals inherently exist within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct a natural scene from a single image. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models, and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one-million images and thousands of assets. Our approach, trained exclusively …
Poster
Muhammad Hamza Mughal · Rishabh Dabral · Merel CJ Scholman · Vera Demberg · Christian Theobalt
[ ExHall D ]
Abstract
Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to watch the supplementary video.
Poster
Suhyun Shin · Seungwoo Yoon · Ryota Maeda · Seung-Hwan Baek
[ ExHall D ]
Abstract
Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times—often over several minutes—or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film.We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
Poster
Jongsung Lee · HARIN PARK · Byeong-Uk Lee · Kyungdon Joo
[ ExHall D ]
Abstract
Motivated by the efficiency of spherical harmonics (SH) in representing various physical phenomena, we propose a Holistic panoramic 3D scene Understanding framework using Spherical Harmonics, dubbed as HUSH. Our approach focuses on a unified framework adaptable to various 3D scene understanding tasks via SH bases. To achieve this, we first estimate SH coefficients, allowing for the adaptive configuration of the SH bases specific to each scene. HUSH then employs a hierarchical attention module that uses SH bases as queries to generate comprehensive scene features by integrating these scene-adaptive SH bases with image features. Additionally, we introduce an SH basis index module that adaptively emphasizes relevant SH bases to produce task-relevant features, enhancing the versatility of HUSH across different scene understanding tasks. Finally, by combining the scene features with task-relevant features in the task-specific heads, we perform various scene understanding tasks, including depth, surface normal and room layout estimation. Experiments demonstrate that HUSH achieves state-of-the-art performance on depth estimation benchmarks, highlighting the robustness and scalability of using SH in panoramic 3D scene understanding.
Poster
Kang Chen · Jiyuan Zhang · Zecheng Hao · Yajing Zheng · Tiejun Huang · Zhaofei Yu
[ ExHall D ]
Abstract
Spike cameras, as an innovative neuromorphic camera that captures scenes with the 0-1 bit stream at 40 kHz, are increasingly employed for the 3D reconstruction task via Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Previous spike-based 3D reconstruction approaches often employ a casecased pipeline: starting with high-quality image reconstruction from spike streams based on established spike-to-image reconstruction algorithms, then progressing to camera pose estimation and 3D reconstruction. However, this cascaded approach suffers from substantial cumulative errors, where quality limitations of initial image reconstructions negatively impact pose estimation, ultimately degrading the fidelity of the 3D reconstruction. To address these issues, we propose a synergistic optimization framework USP-Gaussian, that unifies spike-based image reconstruction, pose correction, and Gaussian splatting into an end-to-end framework. Leveraging the multi-view consistency afforded by 3DGS and the motion capture capability of the spike camera, our framework enables a joint iterative optimization that seamlessly integrates information between the spike-to-image network and 3DGS. Experiments on synthetic datasets with accurate poses demonstrate that our method surpasses previous approaches by effectively eliminating cascading errors. Moreover, we integrate pose optimization to achieve robust 3D reconstruction in real-world scenarios with inaccurate initial poses, outperforming alternative methods by effectively reducing noise and preserving …
Poster
Xuan Zhu · Jijun Xiang · Xianqi Wang · Longliang Liu · Yu Wang · Hong Zhang · Fei Guo · Xin Yang
[ ExHall D ]
Abstract
Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets.
Poster
Nisha Varghese · A. N. Rajagopalan
[ ExHall D ]
Abstract
Underwater (UW) robotics applications require depth and restored images simultaneously in real-time, irrespective of whether the UW images are captured in good lighting conditions or not. Most of the UW image restoration and depth estimation methods have been devised for images under normal lighting. Consequently, they struggle to perform on poorly lit images. Even though artificial illumination can be used when there is insufficient ambient light, it can introduce non-uniform lighting artifacts in the restored images. Hence, the recovery of depth and restored images directly from Low-Light UW (LLUW) images is a critical requirement in marine applications. While a few works have attempted LLUW image restoration, there are no reported works on joint recovery of depth and clean image from LLUW images. We propose a Self-supervised Low-light Underwater Image and Depth recovery network (SelfLUID-Net) for joint estimation of depth and restored image in real-time from a single LLUW image. We have collected an Underwater Low-light Stereo Video (ULVStereo) dataset which is the first-ever UW dataset with stereo pairs of low-light and normally-lit UW images. For the dual tasks of image and depth recovery from a LLUW image, we effectively utilize the stereo data from ULVStereo that provides cues for both …
Poster
Jingyi Zhou · Peng Ye · Haoyu Zhang · Jiakang Yuan · Rao Qiang · Liu YangChenXu · Wu Cailin · Feng Xu · Tao Chen
[ ExHall D ]
Abstract
Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.
Poster
Yuzheng Liu · Siyan Dong · Shuzhe Wang · Yingda Yin · Yanchao Yang · Qingnan Fan · Baoquan Chen
[ ExHall D ]
Abstract
In this paper, we introduce SLAM3R, a novel and effective monocular RGB SLAM system for real-time and high-quality dense 3D reconstruction. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given a video input, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images and then progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Upon acceptance, we will release our code to support further research.
Poster
Diankun Wu · Fangfu Liu · Yi-Hsin Hung · Yue Qian · Xiaohang Zhan · Yueqi Duan
[ ExHall D ]
Abstract
4D reconstruction from a single monocular video is an important but challenging task due to its inherent under-constrained nature. While most existing 4D reconstruction methods focus on multi-camera settings, they always suffer from limited multi-view information in monocular videos. Recent studies have attempted to mitigate the ill-posed problem by incorporating data-driven priors as additional supervision. However, they require hours of optimization to align the splatted 2D feature maps of explicit Gaussians with various priors, which limits the range of applications. To address the time-consuming issue, we propose \textbf{4D-Fly}, an efficient and effective framework for reconstructing the 4D scene from a monocular video (hundreds of frames within 6 minutes), more than \textbf{20} × \textbf{faster} and even achieving higher quality than previous optimization methods. Our key insight is to unleash the explicit property of Gaussian primitives and directly apply data priors to them. Specifically, we build a streaming 4D reconstruction paradigm that includes: propagating existing Gaussian to the next timestep with an anchor-based strategy, expanding the 4D scene map with the canonical Gaussian map, and an efficient 4D scene optimization process to further improve visual quality and motion accuracy. Extensive experiments demonstrate the superiority of our 4D-Fly over state-of-the-art methods in terms …
Poster
Juan Carlos Dibene Simental · Enrique Dunn
[ ExHall D ]
Abstract
We present a marker-based geometric estimation framework for the absolute pose of a camera by analyzing the 1D observations in a single radially distorted pixel scanline.We leverage a pair of known co-planar pencils of lines, along with lens distortion parameters, to propose an ensemble of solvers exploring the space of estimation strategies applicable to our setup.First, we present a minimal algebraic solver requiring only six measurements and yielding eight solutions, which relies on the intersection of two conics defined by one of the pencils of lines.Then, we present a unique closed-form geometric solver from seven measurements.Finally, we present an homography-based formulation amenable to linear least-squares from eight or more measurements.Our geometric framework constitutes a theoretical analysis on the minimum geometric context necessary to solve in closed form for the absolute pose of a single camera from a single radially distorted scanline.
Poster
Andrea Porfiri Dal Cin · Georgi Dikov · Jihong Ju · Mohsen Ghafoorian
[ ExHall D ]
Abstract
Current learning-based Structure-from-Motion (SfM) methods struggle with videos of dynamic scenes from wide-angle cameras. We present AnyMap, a differentiable SfM framework that jointly addresses image distortion and motion estimation. By learning a general implicit camera model without predefined parameters, AnyMap effectively handles lens distortion, estimating multi-view consistent 3D geometry, camera poses, and (un)projection functions. To resolve the ambiguity where motion estimation can compensate for undistortion errors and vice versa, we introduce a low-dimensional motion representation consisting of a set of learnable basis trajectories, interpolated to produce regularized motion estimates. Experimental results show that our method produces accurate camera poses, excels in camera calibration and image rectification, and enables high-quality novel view synthesis. Our low-dimensional motion representation effectively disentangles undistortion with motion estimation, outperforming existing methods.
Poster
Junchen Yu · Si-Yuan Cao · Runmin Zhang · Chenghao Zhang · Zhu Yu · Shujie Chen · Bailin Yang · Hui-Liang Shen
[ ExHall D ]
Abstract
We propose a novel unsupervised cross-modal homography estimation learning framework, named Split Supervised Homography estimation Network (SSHNet). SSHNet redefines the unsupervised cross-modal homography estimation into two supervised sub-problems, each addressed by its specialized network: a homography estimation network and a modality transfer network. To realize stable training, we introduce an effective split optimization strategy to train each network separately within its respective sub-problem. We also formulate an extra homography feature space supervision to enhance feature consistency, further boosting the estimation accuracy. Moreover, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. The training stability of SSHNet enables its cooperation with various homography estimation architectures. Experiments reveal that the SSHNet using IHN as homography estimation network, namely SSHNet-IHN, outperforms previous unsupervised approaches by a significant margin. Even compared to supervised approaches MHN and LocalTrans, SSHNet-IHN achieves 47.4\% and 85.8\% mean average corner errors (MACEs) reduction on the challenging OPT-SAR dataset. The source code is provided in the supplementary material.
Poster
Riku Murai · Eric Dexheimer · Andrew J. Davison
[ ExHall D ]
Abstract
We present a real-time monocular dense SLAM system designed bottom-up from MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system is robust on in-the-wild video sequences despite making no assumption on a fixed or parametric camera model beyond a unique camera centre. We introduce efficient methods for pointmap matching, camera tracking and local fusion, graph construction and loop closure, and second-order global optimisation. With known calibration, a simple modification to the system achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.
Poster
Yifan Yu · Shaohui Liu · Rémi Pautrat · Marc Pollefeys · Viktor Larsson
[ ExHall D ]
Abstract
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent …
Poster
Felix Wimbauer · Weirong Chen · Dominik Muhle · Christian Rupprecht · Daniel Cremers
[ ExHall D ]
Abstract
Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training.As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera motions. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera …
Poster
Yunxuan Li · Lei Fan · Xiaoying Xing · Jianxiong Zhou · Ying Wu
[ ExHall D ]
Abstract
Visual localization, the task of determining the position and orientation of a camera, typically involves three core components: offline construction of a keyframe database, efficient online keyframes retrieval, and robust local feature matching. However, significant challenges arise when there are large viewpoint disparities between the query view and the database, such as attempting localization in a corridor previously build from an opposing direction. Intuitively, this issue can be addressed by synthesizing a set of virtual keyframes that cover all viewpoints. However, existing methods for synthesizing novel views to assist localization often fail to ensure geometric accuracy under large viewpoint changes. In this paper, we introduce a confidence-aware geometric prior into 2D Gaussian splatting to ensure the geometric accuracy of the scene. Then we can render novel views through the mesh with clear structures and accurate geometry, even under significant viewpoint changes, enabling the synthesis of a comprehensive set of virtual keyframes. Incorporating this geometry-preserving virtual keyframe database into the localization pipeline significantly enhances the robustness of visual localization.
Poster
Siyan Dong · Shuzhe Wang · Shaohui Liu · Lulu Cai · Qingnan Fan · Juho Kannala · Yanchao Yang
[ ExHall D ]
Abstract
Visual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately 8 million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on 6 public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. Upon acceptance, we will make our code and training data publicly available.
Poster
Mi Luo · Zihui Xue · Alex Dimakis · Kristen Grauman
[ ExHall D ]
Abstract
Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical for applications in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignments between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and egocentric video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.
Poster
Alan Baade · Changan Chen
[ ExHall D ]
Abstract
Learning self-supervised visual correspondence is a long-studied task fundamental to visual understanding and human perception. However, existing correspondence methods largely focus on small image transformations, such as object tracking in high-framerate videos or learning pixel-to-pixel mappings between images with high view overlap. This severely limits their application in dynamic multi-view settings such as robot imitation learning or augmented reality. In this work, we introduce Predictive Cycle Consistency for learning object correspondence between extremely disjoint views of a scene without paired segmentation data. Our technique bootstraps object correspondence pseudolabels from raw image segmentations using conditional grayscale colorization and a cycle-consistency refinement prior. We then train deep ViTs on these pseudolabels, which we use to generate higher-quality pseudolabels and iteratively train better correspondence models. We demonstrate the performance of our method under both extreme in-the-wild camera view changes and across large temporal gaps in video. Our approach beats all prior supervised and prior SoTA self-supervised correspondence models on the EgoExo4D correspondence benchmark (+6.7 IoU Exo Query) and the prior SoTA self-supervised methods SiamMAE and DINO V1&V2 on the DAVIS-2017 and LVOS datasets across large frame gaps.
Poster
Ruojin Cai · Jason Y. Zhang · Philipp Henzler · Zhengqi Li · Noah Snavely · Ricardo Martin
[ ExHall D ]
Abstract
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation.Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos.We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R baseline on four diverse datasets encompassing indoor, outdoor, and object-centric scenes.Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data.
Poster
Sven Elflein · Qunjie Zhou · Laura Leal-Taixe
[ ExHall D ]
Abstract
We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.
Poster
Yuguang Li · Ivaylo Boyadzhiev · Zixuan Liu · Linda Shapiro · Alex Colburn
[ ExHall D ]
Abstract
Reconstructing precise camera poses and floor plan layouts from a set of wide-baseline RGB panoramas is a difficult and unsolved problem. We present BADGR, a novel diffusion model which performs both reconstruction and bundle adjustment (BA) optimization tasks, to refine camera poses and layouts from a given coarse state using 1D floor boundary information from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-feature outputs from a single-step Levenberg-Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layouts reconstruction with different input densities.
Poster
Chi Su · Xiaoxuan Ma · Jiajun Su · Yizhou Wang
[ ExHall D ]
Abstract
We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods. Code and models will be publicly released.
Poster
Hongwei Zheng · Han Li · Wenrui Dai · Ziyang Zheng · Chenglin Li · Junni Zou · Hongkai Xiong
[ ExHall D ]
Abstract
Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a skeleton-aware alignment to strengthen token connections. We then develop a hierarchical autoregressive modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
Poster
Weijian Deng · Dylan Campbell · Chunyi Sun · Jiahao Zhang · Shubham Kanitkar · Matthew Shaffer · Stephen Gould
[ ExHall D ]
Abstract
Foundation models have significantly reduced the need for task-specific training, while also enhancing generalizability. However, state-of-the-art 6D pose estimators either require further training with pose supervision or neglect advances obtainable from 3D foundation models. The latter is a missed opportunity, since these models are better equipped to predict 3D-consistent features, which are of significant utility for the pose estimation task. To address this gap, we propose Pos3R, a method for estimating the 6D pose of any object from a single RGB image, making extensive use of a 3D reconstruction foundation model and requiring no additional training. We identify template selection as a particular bottleneck for existing methods that is significantly alleviated by the use of a 3D model, which can more easily distinguish between template poses than a 2D model. Despite its simplicity, Pos3R achieves competitive performance on the BOP benchmark across seven diverse datasets, matching or surpassing existing refinement-free methods. Additionally, Pos3R integrates seamlessly with render-and-compare refinement techniques, demonstrating adaptability for high-precision applications.
Poster
Tao Tan · Qiulei Dong
[ ExHall D ]
Abstract
Self-supervised 6D object pose estimation has received increasing attention in computer vision recently. Some typical works in literature attempt to translate the synthetic images with object pose labels generated by object CAD models into the real domain, and then use the translated data for training. However, their performance is generally limited, since (i) there still exists a domain gap between the translated images and the real images and (ii) the translated images can not sufficiently reflect occlusions that exist in many real images. To address these problems, we propose an Occlusion-Aware Neural Domain Adaptation method for self-supervised 6D object Pose estimation, called ONDA-Pose. The proposed method comprises three main steps. Firstly, by utilizing both the training real images without pose labels and a CAD model, we explore a CAD-like radiance field for rendering corresponding synthetic images that have similar textures to those generated by the CAD model. Then, a backbone pose estimator trained on the synthetic data is employed to provide initial pose estimations for the synthetic images rendered from the CAD-like radiance field, and the initial object poses are refined by a global object pose refiner to generate pseudo object pose labels. Finally, the backbone pose estimator is further …
Poster
Junning Qiu · Minglei Lu · Fei Wang · Yu Guo · Yonggen Ling
[ ExHall D ]
Abstract
Stereo-based category-level shape and 6D pose estimation methods have the potential to generalize to a wider range of materials than RGB-D methods, which often suffer from depth measurement errors. However, without explicit depth from two views, parameters to be estimated can become inherently entangled, negatively impacting performance.To address this, we propose a method that leverages global stereo consistency to constrain optimization directions and mitigate parameter entanglement.We first estimate an intra-category occupancy field to represent a unified shape across views, ensuring consistency and preventing shape ambiguity. Through a divide-and-conquer approach within global shape fitting, we fit this shape to stereo images to obtain the pose, iteratively rendering normalized depth maps and exchanging information across views. This approach improves convergence toward the correct pose and scale.We validated our method on both depth-friendly and depth-challenging materials using our S-RGBD dataset and the TOD benchmark. Our method surpasses RGBD methods on challenging objects and performs comparably on depth-friendly ones. Ablation studies confirm the effectiveness of each component.
Poster
Li Jin · Yujie Wang · Wenzheng Chen · Qiyu Dai · Qingzhe Gao · Xueying Qin · Baoquan Chen
[ ExHall D ]
Abstract
3D object canonicalization is a fundamental task, essential for a variety of downstream tasks. Existing methods rely on either cumbersome manual processes or priors learned from extensive, per-category training samples. Real-world datasets, however, often exhibit long-tail distributions, challenging existing learning-based methods, especially in categories with limited samples. We address this by introducing the first one-shot category-level object canonicalization framework, requiring only a single canonical model as a reference (the "prior model") for each category. To canonicalize any object, our framework first extracts semantic cues with large language models (LLMs) and vision-language models (VLMs) to establish correspondences with the prior model. We introduce a novel loss function to enforce geometric and semantic consistency, aligning object orientations precisely despite significant shape variations. Moreover, we adopt a support-plane strategy to reduce search space for initial poses and utilize a semantic relationship map to select the canonical pose from multiple hypotheses. Extensive experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance and validate key design choices. Using our framework, we create the Canonical Objaverse Dataset (COD), canonicalizing 33K samples in the Objaverse-LVIS dataset, underscoring the effectiveness of our framework on handling large-scale datasets.
Poster
Zixuan Huang · Mark Boss · Aaryaman Vasishta · James Rehg · Varun Jampani
[ ExHall D ]
Abstract
We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables a probabilistic modeling of the ill-posed single-image 3D task, while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds.
Poster
Wenrui Cai · Qingjie Liu · Yunhong Wang
[ ExHall D ]
Abstract
Most current state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer backbone for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background patches require restricted modeling participation, while foreground, particularly boundary areas, need to be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code will be released for further research.
Poster
Haolin Qin · Tingfa Xu · Tianhao Li · Zhenxiang Chen · Tao Feng · Jianan Li
[ ExHall D ]
Abstract
UAV tracking faces significant challenges in real-world scenarios, such as small-size targets, complex backgrounds, and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this area has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale dataset for Multispectral UAV Single Object Tracking (MUST), which includes 250 video sequences spanning diverse environments and challenging scenarios, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which integrates spectral, spatial, and temporal features using spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area.
Poster
Bangyan Liao · Zhenjun Zhao · Haoang Li · Yi Zhou · Yingping Zeng · Hao Li · Peidong Liu
[ ExHall D ]
Abstract
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a “soft” association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs’ locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called GlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that GlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous …
Poster
Huijie Fan · Yu Qiao · Yihao Zhen · Tinghui Zhao · Baojie Fan · Qiang Wang
[ ExHall D ]
Abstract
The capability of tracking objects in low-light environments like nighttime is crucial for numerous real-world applications such as crowd behavior analysis and traffic scene understanding. However, previous Multi-Camera Multi-Target(MCMT) tracking methods are primarily focused on tracking during daytime with favorable lighting, shying away from low-light environments. The main difficulty of tracking under low light condition is lack of detailed visible appearance features. To address this issue, we incorporate the infrared modality into MCMT tracking framework to provide more useful information. We constructed the first Multi-modality(RGBT) Multi-camera Multi-target tracking dataset named M3Track, which contains sequences captured in low-light environments, laying a solid foundation for all-day multi-camera tracking. Based on the proposed dataset, we propose All-Day Multi-Camera Multi-Target tracking network, termed as ADMCMT. Specifically, we propose an All-Day Mamba Fusion(ADMF) model to adaptively fuse information from different modalities. Within ADMF, the Lighting Guidance Model(IGM) extracts lighting relevant information to guide the fusion process. Furthermore, the Nearby Target Collection(NTC) strategy is designed to enhance tracking accuracy by leveraging information derived from surrounding objects of target. Experiments conducted on M3Track demonstrate that ADMCMT exhibits strong generalization across different lighting conditions. The code will be released soon.
Poster
Sunkyung Park · Jeongmin Lee · Dongjun Lee
[ ExHall D ]
Abstract
Shape abstraction, simplifying shape representation into a set of primitives, is a fundamental topic in computer vision. The choice of primitives shapes the structure of world understanding, yet achieving both high abstraction accuracy and versatility remains challenging. In this paper, we introduce a novel framework for shape abstraction utilizing a differentiable support function (DSF), which offers unique advantages in representing a wide range of convex shapes with fewer parameters, providing smooth surface approximation and enabling differentiable contact features (gap, point, normal) essential for downstream applications involving contact-related problems. To tackle the associated optimization and combinatorial challenges, we introduce two techniques: differentiable shape parameterization and hyperplane-based marching to enhance accuracy and reduce DSF requirements. We validate our method through experiments demonstrating superior accuracy and efficiency, and showcase its applicability in tasks requiring differentiable contact information.
Poster
Shaoming Li · Qing Cai · Songqi KONG · Runqing Tan · Heng Tong · Shiji Qiu · Yongguo Jiang · Zhi Liu
[ ExHall D ]
Abstract
Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3DSPL. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction …
Poster
Xinjun Li · Wenfei Yang · Jiacheng Deng · Zhixin Cheng · Xu Zhou · Tianzhu Zhang
[ ExHall D ]
Abstract
Image-to-point cloud registration aims to estimate the camera pose of a given image within a 3D scene point cloud. In this area, matching-based methods have achieved leading performance by first detecting the overlapping region, then matching point and pixel features learned by neural networks and finally using the PnP-RANSAC algorithm to estimate camera pose. However, achieving accurate image-to-point cloud registration remains challenging because the overlapping region detection is unreliable merely relying on point-wise classification, direct alignment of cross-modal data is difficult and indirect optimization objective leads to unstable registration results. To address these challenges, we propose a novel implicit correspondence learning method, including a Geometric Prior-guided overlapping region Detection Module (GPDM), an Implicit Correspondence Learning Module (ICLM), and a Pose Regression Module (PRM). The proposed method enjoys several merits. First, the proposed GPDM can precisely detect the overlapping region. Second, the ICLM can generate robust cross-modality correspondences. Third, the PRM can enable end-to-end optimization. Extensive experimental results on KITTI and nuScenes datasets demonstrate that the proposed model sets a new state-of-the-art performance in registration accuracy.
Poster
Rao Fu · Jianmin Zheng · Liang Yu
[ ExHall D ]
Abstract
The orientation of surface normals in 3D point cloud is a fundamental problem in computer vision and graphics. Determining a globally consistent orientation solely from the point cloud is however challenging due to the global scope of the problem and the discrete nature of point cloud, particularly in the presence of noise, outliers, holes, thin structures, and complex topologies.This paper presents an efficient, robust, and global algorithm for generating consistent normal orientation of a dense 3D point cloud. The basic idea is to transform the original binary normal orientation problem to finding a relaxed sign field on a Delaunay graph, which can be achieved by solving a sparse linear system. The Delaunay graph is constructed by triangulating a level set of an implicit function defined from the input point cloud. The shape diameter function is estimated to serve as a prior for determining an appropriate level value such that the level set implicitly defines the inner and outer shells enclosing the input point clouds. As such, our algorithm leverages the strengths of the shape diameter function, Delaunay triangulation, and the least-square techniques, making the underlying processes take both geometry and topology into consideration, and thus provides an efficient and robust …
Poster
Haobo Jiang · Jin Xie · Jian Yang · Liang Yu · Jianmin Zheng
[ ExHall D ]
Abstract
This paper introduces a novel task: zero-shot RGB-D point cloud registration, aimed at achieving robust 3D matching on in-the-wild data without any task-specific training. This task is both challenging and of high practical value. We present a powerful zero-shot RGB-D matching framework, {\em ZeroMatch}, which innovatively leverages the pre-trained large-scale vision model, Stable Diffusion, to address this challenge. Our core idea is to utilize the powerful zero-shot image representation of Stable Diffusion, achieved through extensive pre-training on large-scale data, to enhance point-cloud geometric descriptors for robust matching. Specifically, we combine the handcrafted geometric descriptor FPFH with Stable-Diffusion features to create point descriptors that are both locally and contextually aware, enabling reliable RGB-D registration with zero-shot capability. This approach is based on our observation that Stable-Diffusion features effectively encode discriminative global contextual cues, naturally alleviating the feature ambiguity that FPFH often encounters in scenes with repetitive patterns or low overlap. To further enhance cross-view consistency of Stable-Diffusion features for improved matching, we propose a coupled-image input mode that concatenates the source and target images into a single input, replacing the original single-image mode. This design achieves both inter-image and prompt-to-image consistency attentions, facilitating robust cross-view feature interaction and alignment. Finally, we …
Poster
Yi Du · Zhipeng Zhao · Shaoshu Su · Sharath Golluri · Haoze Zheng · Runmao Yao · Chen Wang
[ ExHall D ]
Abstract
Point cloud (PC) processing tasks—such as completion, upsampling, denoising, and colorization—are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others.Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing.We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.
Poster
Khanh Nguyen · Ghulam Mubashar Hassan · Ajmal Mian
[ ExHall D ]
Abstract
Recent open-world representation learning approaches have leveraged CLIP to enable zero-shot 3D object recognition. However, performance on real point clouds with occlusions still falls short due to the unrealistic pretraining settings. Additionally, these methods incur high inference costs because they rely on Transformer's attention modules. In this paper, we make two contributions to address these limitations. First, we propose occlusion-aware text-image-point cloud pretraining to reduce the training-testing domain gap. From 52K synthetic 3D objects, our framework generates nearly 630K partial point clouds for pretraining, consistently improving real-world recognition performances of existing popular 3D networks. Second, to reduce computational requirements, we introduce DuoMamba, a two-stream linear state space model tailored for point clouds. By integrating two space-filling curves with 1D convolutions, DuoMamba effectively models spatial dependencies between point tokens, offering a powerful alternative to Transformer. When pretrained with our framework, DuoMamba surpasses current state-of-the-art methods while reducing latency and FLOPs, highlighting the potential of our approach for real-world applications. We will release our data and code to facilitate future research.
Poster
Yaohua Zha · Yanzi Wang · Hang Guo · Jinpeng Wang · Tao Dai · Bin Chen · Zhihao Ouyang · Xue Yuerong · Ke Chen · Shu-Tao Xia
[ ExHall D ]
Abstract
Recently, applying pre-trained models to assist in 3D point cloud analysis has become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information present in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding.Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information.Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. The code will be released.
Poster
Boqian Zhang · shen yang · Hao Chen · Chao Yang · Jing Jia · Guang Jiang
[ ExHall D ]
Abstract
Point cloud upsampling can improve the quality of the initial point cloud, significantly enhancing the performance of downstream tasks such as classification and segmentation. Existing methods mostly focus on generating the geometric details of point clouds, neglecting noise suppression. To address this, we propose a novel network based on a conditional diffusion model, incorporating the Adaptive Noise Suppression (ANS) module, which we refer to as PDANS. The ANS module assigns weights to each point and determines the removal strategy based on these weights, reducing the impact of noisy points on the sampling process. The module first selects the neighborhood set for each point in the point cloud and performs a weighted sum between the point and its neighbors. It then adjusts the removal points based on the weighted sum, effectively mitigating the bias caused by outliers. We introduce the TreeTrans (TT) module to capture more correlated feature information. This module learns the interaction between high-level and low-level features, resulting in a more comprehensive and refined feature representation. Our results on several widely used benchmark datasets demonstrate that PDANS exhibits exceptional robustness in noisy point cloud processing and outperforms current state-of-the-art(SOTA) methods in terms of performance. Code is available at https://github.com/Baty2023/PDANS.
Poster
Zhaochong An · Guolei Sun · Yun Liu · Runjia Li · Junlin Han · Ender Konukoglu · Serge Belongie
[ ExHall D ]
Abstract
Generalized few-shot 3D point cloud segmentation (GFS-PCS) enables model adaptation to new classes with a few support samples while retaining base class segmentation. Existing GFS-PCS approaches focus on enhancing prototypes via interacting with support or query features but remain limited by the sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), designed to generalize across open-world novel classes by aligning with language models, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, to further utilize few-shot samples, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks …
Poster
Yujun Liu · Ruisheng Wang · Shangfeng Huang · GuoRong Cai
[ ExHall D ]
Abstract
Building reconstruction is a challenging problem at the intersection of computer vision, photogrammetry and computer graphics. 3D wireframe presents a compelling representation for building modeling through its compact structure. Existing wireframe reconstruction methods employing vertex detection and edge regression have achieved promising results. In this paper, we develop an \textbf{Edge}-aware \textbf{Diff}usion network, dubbed \textbf{EdgeDiff}. As a novel paradigm for wireframe reconstruction, the EdgeDiff generates wireframe models from noise using a conditional diffusion model. During the training process, the ground truth wireframes firstly are formulated as a set of parameterized edges and then diffused into a random noise distribution. EdgeDiff learns both the noise reversal process and the network structure simultaneously. During inference, EdgeDiff iteratively refines the generated edge distribution using the denoising diffusion implicit model, enabling flexible single- or multi-step denoising and dynamic adaptation to buildings of varying complexity. Additionally, given the unique structure of wireframes, we introduce an edge attention module to extract point-wise attention from point features, using it as auxiliary information to facilitate learning of edge cues and guide the network toward improved edge awareness. To the best of our knowledge, EdgeDiff is the first to pioneer the use of a diffusion model in building wireframe reconstruction. …
Poster
Yang Wu · Yun Zhu · Kaihua Zhang · Jianjun Qian · Jin Xie · Jian Yang
[ ExHall D ]
Abstract
3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity is quite limited. This paper presents **WeatherGen**, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which is capable of providing a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator with the spider mamba scan to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle and central ray, excellently maintaining the physical structure of the LiDAR point cloud. Subsequently, we design a latent domain aligner following the generator to transfer real-world knowledge. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision from CLIP, guiding the diffusion model in generating more discriminative data. Finally, we fine-tune WeatherGen with …
Poster
Chenxu Dang · Pei An · Xinmin Zhang · ZaiPeng Duan · Xuzhong Hu · Jie Ma
[ ExHall D ]
Abstract
Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, simple result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in a lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduce the overall complexity, resulting in more compact and information-dense temporal sequences. Furthermore, an innovative grouped hierarchical fusion strategy is proposed, progressively performing sequence scaling and intra-group fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at https://github.com/.
Poster
Dušan Malić · Christian Fruhwirth-Reisinger · Samuel Schulter · Horst Possegger
[ ExHall D ]
Abstract
While surface normals are widely used to analyse 3D scene geometry, surface normal estimation from LiDAR point clouds remains severely underexplored. This is caused by the lack of large-scale annotated datasets on the one hand, and lack of methods that can robustly handle the sparse and often noisy LiDAR data in a reasonable time on the other hand. We address these limitations using a traffic simulation engine and present LiSu, the first large-scale, synthetic LiDAR point cloud dataset with ground truth surface normal annotations, eliminating the need for tedious manual labeling. Additionally, we propose a novel method that exploits the spatiotemporal characteristics of autonomous driving data to enhance surface normal estimation accuracy. By incorporating two regularization terms, we enforce spatial consistency among neighboring points and temporal smoothness across consecutive LiDAR frames. These regularizers are particularly effective in self-training settings, where they mitigate the impact of noisy pseudo-labels, enabling robust real-world deployment. We demonstrate the effectiveness of our method on LiSu, achieving state-of-the-art performance in LiDAR surface normal estimation. Moreover, we showcase its full potential in addressing the challenging task of synthetic-to-real domain adaptation, leading to improved neural surface reconstruction on real-world data.
Poster
huang yongshu · Chen Liu · Minghang Zhu · Sheng Ao · Chenglu Wen · Cheng Wang
[ ExHall D ]
Abstract
LiDAR odometry is a critical module in autonomous driving systems, responsible for accurate localization by estimating the relative pose transformation between consecutive point cloud frames. However, existing studies frequently encounter challenges with unreliable pose estimation, due to the lack of in-depth understanding of scenario and the presence of noise interference. To address this challenge, we propose DiffLO, a semantic-aware LiDAR odometry network with diffusion-based refinement. To mitigate the impact of challenging cases such as dynamic, repetitive patterns, and low textures, we introduce a semantic distillation method that integrates semantic information into the odometry task. This allows the network to gain a semantic understanding of the scene, enabling it to focus more on the objects that are beneficial for pose estimation. Additionally, to enhance the robustness, we propose a diffusion-based refinement method. This method uses pose-related features as conditional constraints for generative diversity, iteratively refining the pose estimation to achieve greater accuracy. Comparative experiments on the KITTI odometry dataset demonstrate that the proposed method achieves the state-of-the-art performance. In particular, the proposed DiffLO is not only more robust than the classic method A-LOAM, but also has better generalization ability than existing learning-based methods. The code will be released.
Poster
Duc-Hai Pham · Tung Do · Phong Nguyen · Binh-Son Hua · Khoi Nguyen · Rang Nguyen
[ ExHall D ]
Abstract
We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth’s effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.
Poster
Haotong Lin · Sida Peng · Jingxiao Chen · Songyou Peng · Jiaming Sun · Minghuan Liu · Hujun Bao · Jiashi Feng · Xiaowei Zhou · Bingyi Kang
[ ExHall D ]
Abstract
Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets. Furthermore, we demonstrate that it benefits several downstream applications, including 3D reconstruction and generalized robotic grasping. Code will be released.
Poster
Xiaomeng Chu · Jiajun Deng · Guoliang You · Yifan Duan · Houqiang Li · Yanyong Zhang
[ ExHall D ]
Abstract
We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9\% mAP and 70.2\% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures …
Poster
Lei Lai · Zekai Yin · Eshed Ohn-Bar
[ ExHall D ]
Abstract
We present a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, addressing traditional limitations in VO algorithms associated with specific sensors and predefined settings. Our approach incorporates three main innovations. First, we introduce a language-based prior that infuses semantic information, enhancing robust feature extraction and enabling effective generalization to previously unseen domains. Second, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Third, we demonstrate that our flexible architecture can leverage an unconstrained, semi-supervised training process that iteratively adapts to new scenes using unlabeled data, further boosting its ability to generalize across diverse scenarios. We focus on autonomous driving contexts and validate our approach extensively on three standard benchmarks—KITTI, nuScenes, and Argoverse 2—as well as a newly generated, high-fidelity synthetic dataset from Grand Theft Auto (GTA). Our work advances the boundaries of VO applicability, offering a versatile solution for real-world deployment at scale.
Poster
You Wu · Xucheng Wang · Xiangyang Yang · Mengyuan Liu · Dan Zeng · Hengzhou Ye · Shuiwang Li
[ ExHall D ]
Abstract
Recently, there has been a significant rise in the use of single-stream architectures in visual tracking. These architectures effectively integrate feature extraction and fusion by leveraging pre-trained Vision Transformer (ViT) backbones. However, this framework is susceptible to target occlusion, a frequent challenge in Unmanned Aerial Vehicle (UAV) tracking due to the prevalence of buildings, mountains, trees, and other obstructions in aerial views. To our knowledge, there hasn't been exploration into learning occlusion-robust representations for UAV tracking within this framework. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks …
Poster
Jesse Hagenaars · Yilun Wu · Federico Paredes Valles · Stein Stroobants · Guido De Croon
[ ExHall D ]
Abstract
Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need to high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, onboard learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization learning pipeline. Benchmarking experiments show that the proposed pipeline achieves competitive results with the state of the art on the task of depth estimation from events. Furthermore, we demonstrate the usability of the learned depth for obstacle avoidance through real-world flight experiments. Finally, we compare the performance of different combinations of pre-training and fine-tuning of the depth estimation networks, showing that on-board domain adaptation is feasible given a few minutes of flight.
Poster
Shu-Wei Lu · Yi-Hsuan Tsai · Yi-Ting Chen
[ ExHall D ]
Abstract
Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of downstream autonomous driving tasks, such as forecasting and planning. However, the task is challenging because it is inherently an ill-posed problem due to the lack of depth information. Moreover, fusing multi-view images into a unified representation without depth cues becomes more challenging.Recent grid-based methods formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because they lack uncertainty modeling and are computationally expensive.In this work, we revisit depth-based methods and endow them with uncertainty awareness. Specifically, we calculate the variance of the depth distribution to represent how objects are spatially dispersed around their mean depth. In addition, the proposed model learns a soft depth mean and implicitly captures the spatial extent of objects. We transform the depth distribution into 3D Gaussians and utilize the rasterization technique to form uncertainty-aware BEV features.We evaluate our method on the nuScenes dataset, achieving state-of-the-art performance compared to depth-based methods. Notably, our model provides significant advantages in speed—running 2x faster—and in memory efficiency, using …
Poster
Gyeongrok Oh · Sungjune Kim · Heeju Ko · Hyunggun Chi · Jinkyu Kim · Dongwook Lee · Daehyun Ji · Sungjoon Choi · Sujin Jang · Sangpil Kim
[ ExHall D ]
Abstract
The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75% reduced voxel resolution.
Poster
Hyo-Jun Lee · Yeong Jun Koh · Hanul Kim · Hyunseop Kim · Yonguk Lee · Jinu Lee
[ ExHall D ]
Abstract
Vision-centric 3D Semantic Scene Completion (SSC) aims to reconstruct a fine-grained 3D scene from an input RGB image. Since the vision-centric 3D SSC is an ill-posed problem, it requires a precise 2D-3D view transformation. However, existing view transformations inevitably experience erroneous feature duplication in the reconstructed voxel space due to occlusions, leading to a dilution of informative contexts. Furthermore, semantic classes exhibit high variability in their appearance in real-world driving scenarios. To address these issues, we introduce a novel 3D SSC method, called SOAP, including two key components: an occluded region-aware view projection and a scene-adaptive decoder. The occluded region-aware view projection effectively converts 2D image features into voxel space, refining the duplicated features of occluded regions using information gathered from previous observations. The scene-adaptive decoder guides query embeddings to learn diverse driving environments based on a comprehensive semantic repository. Extensive experiments validate that the proposed SOAP significantly outperforms existing methods for the vision-centric 3D SSC on automated driving datasets, SemanticKITTI and SSCBench.
Poster
Yancong Lin · Shiming Wang · Liangliang Nan · Julian F. P. Kooij · Holger Caesar
[ ExHall D ]
Abstract
Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code will be released upon acceptance.
Poster
Haiming Zhang · Wending Zhou · Shenzhen The Chinese University of Hongkong · Hong Kong University of Science and Technology · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Huawei Technologies Ltd. · Shenzhen The Chinese University of Hong Kong
[ ExHall D ]
Abstract
This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
Poster
Kuang Wu · Chuan Yang · Zhanbin Li
[ ExHall D ]
Abstract
Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.
Poster
Wei Wu · Xi Guo · Weixuan TANG · Tingxuan Huang · Chiyu Wang · Chenjing Ding
[ ExHall D ]
Abstract
Recent advancements in generative models offer promising solutions for synthesizing realistic driving videos, aiding in training autonomous driving perception models. However, existing methods often struggle with high-resolution multi-view generation, mainly due to the significant memory and computational overhead caused by simultaneously inputting multi-view videos into denoising diffusion models.In this paper, we propose a driving video generation framework based on multi-view feature fusion named DriveScape for multi-view 3D condition-guided video generation. We introduce a Bi-Directional Modulated Transformer (BiMoT) module to encode, fuse and inject multi-view features along with various 3D road structures and objects, which enables high-resolution multi-view generation. Consequently, our approach allows precise control over video generation, greatly enhancing realism and providing a robust solution for creating high-quality, multi-view driving videos.Our framework achieves state-of-the-art results on the nuScenes dataset, demonstrating impressive generative quality metrics with an FID score of 8.34 and an FVD score of \textbf{76.39}, as well as superior performance across various perception tasks. This lays the foundation for more accurate environment simulation in autonomous driving. We plan to make our code and pre-trained model publicly available.Please refer to index.html webpage in the supplementary materials for more visualization results.
Poster
Changsheng Lv · Mengshi Qi · Liang Liu · Huadong Ma
[ ExHall D ]
Abstract
Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.
Poster
Luke Rowe · Roger Girgis · Anthony Gosselin · Liam Paull · Christopher Pal · Felix Heide
[ ExHall D ]
Abstract
We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene (comprising the lane graph and agent bounding boxes) and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networks that perform unnecessary computation due to many empty pixels in the rasterized scene. Moreover, we find that existing methods that employ rule-based agent behaviours lack diversity and realism. Scenario Dreamer instead comprises a novel vectorized latent diffusion model for initial traffic scene generation and a return-conditioned autoregressive Transformer for agent behaviour simulation. Scenario Dreamer supports scene extrapolation via diffusion inpainting, enabling the generation of unbounded simulation environments. We validate that Scenario Dreamer environments are more realistic than those generated from existing methods while requiring fewer parameters and lower generation latency. We confirm its practical utility by showing that RL-based planning agents are more challenged in Scenario Dreamer environments than traditional non-generative simulation environments, especially on long and adversarial driving environments. All code will be open-sourced.
Poster
Zhiwei Dong · Ran Ding · Wei Li · Zhang Peng · Guobin Tang · Jia Guo
[ ExHall D ]
Abstract
Latest trajectory prediction models in real-world autonomous driving systems often rely on online High-Definition (HD) maps to understand the road environment.However, online HD maps suffer from perception errors and feature redundancy, which hinder the performance of HD map-based trajectory prediction models.To address these issues, we introduce a framework, termed SD map-Augmented Trajectory Prediction (SATP), which leverages Standard-Definition (SD) maps to enhance HD map-based trajectory prediction models.First, we propose an SD-HD fusion approach to leverage SD maps across the diverse range of HD map-based trajectory prediction models. Second, we design a novel AlignNet to align the SD map with the HD map, further improving the effectiveness of SD maps. Experiments on real-world autonomous driving benchmarks demonstrate that SATP not only improves the performance of HD map-based trajectory prediction up to 25\% in real-world scenarios using online HD maps but also brings benefits in ideal scenarios with ground-truth HD maps.
Poster
Yi Yu · Weizhen Han · Libing Wu · Bingyi Liu · Enshu Wang · Zhuangzhuang Zhang
[ ExHall D ]
Abstract
Trajectory prediction plays a crucial role in autonomous driving systems, and exploring its vulnerability has garnered widespread attention. However, existing trajectory prediction attack methods often rely on single-point attacks to make efficient perturbations. This limits their applications in real-world scenarios due to the transient nature of single-point attacks, their susceptibility to filtration, and the uncertainty regarding the deployment environment. To address these challenges, this paper proposes a novel LiDAR-induced attack framework to impose multi-frame attacks by optimization-driven adversarial location search, achieving endurance, efficiency, and robustness. This framework strategically places objects near the adversarial vehicle to implement an attack and introduces three key innovations. First, successive state perturbations are generated using a multi-frame single-point attack strategy, effectively misleading trajectory predictions over extended time horizons. Second, we efficiently optimize adversarial objects' locations through three specialized loss functions to achieve desired perturbations. Lastly, we improve robustness by treating the adversarial object as a point without size constraints during the location search phase and reduce dependence on both the specific attack point and the adversarial object's properties. Extensive experiments confirm the superior performance and robustness of our framework.
Poster
Dongkun Zhang · Jiaming Liang · Ke Guo · Sha Lu · Qi Wang · Rong Xiong · Zhenwei Miao · Yue Wang
[ ExHall D ]
Abstract
Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios.In this paper, we introduce CarPlanner, a Consistent auto-regressive Planner that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance.Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving.To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.
Poster
Wufei Ma · Luoxin Ye · Nessa McWeeney · Celso M. de Melo · Alan L. Yuille · Jieneng Chen
[ ExHall D ]
Abstract
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing 3DI-LMM, an LMM with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporates 3D orientation relationships. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our 3DI-LMM advances machines toward highly capable 3D-informed reasoning, surpass GPT-4o performance by 8.7%. Our systematic empirical design and resulting findings offer valuable insights for future research in this direction.
Poster
Zhenhua Xu · Yan Bai · Yujia Zhang · Zhuoling Li · Fei Xia · Kwan-Yee K. Wong · Jianqiang Wang · Hengshuang Zhao
[ ExHall D ]
Abstract
Multimodal large language models (MLLMs) possess the ability to comprehend visual images or videos, and show impressive reasoning ability thanks to the vast amounts of pretrained knowledge, making them highly suitable for autonomous driving applications. Unlike the previous work, DriveGPT4-V1, which focused on open-loop tasks, this study explores the capabilities of LLMs in enhancing closed-loop autonomous driving. DriveGPT4-V2 processes camera images and vehicle states as input to generate low-level control signals for end-to-end vehicle operation. A high-resolution visual tokenizer (HR-VT) is employed enabling DriveGPT4-V2 to perceive the environment with an extensive range while maintaining critical details. The model architecture has been refined to improve decision prediction and inference speed. To further enhance the performance, an additional expert LLM is trained for online imitation learning. The expert LLM, sharing a similar structure with DriveGPT4-V2, can access privileged information about surrounding objects for more robust and reliable predictions. Experimental results show that DriveGPT4-V2 significantly outperforms all baselines on the challenging CARLA Longest6 benchmark. The code and data of DriveGPT4-V2 will be publicly available.
Poster
Ahmad Rahimi · Po-Chien Luan · Yuejiang Liu · Frano Rajič · Alex Alahi
[ ExHall D ]
Abstract
Modeling spatial-temporal interactions among neighboring agents is at the heart of multi-agent problems such as motion forecasting and crowd navigation. Despite notable progress, it remains unclear to which extent modern representations can capture the causal relationships behind agent interactions. In this work, we take an in-depth look at the causal awareness of these representations, from computational formalism to real-world practice. First, we revisit the notion of non-causal robustness studied in the recent CausalAgents benchmark. We show that existing representations are already partially resilient to perturbations of non-causal agents, and yet modeling indirect causal effects involving mediator agents remains challenging. To address this challenge, we introduce a metric learning approach that regularizes latent representations with causal annotations. Our controlled experiments show that this approach not only leads to higher degrees of causal awareness but also yields stronger out-of-distribution robustness. To further operationalize it in practice, we propose a sim-to-real causal transfer method via cross-domain multi-task learning. Experiments on trajectory prediction datasets show that our method can significantly boost generalization, even in the absence of real-world causal annotations, where we acquire higher prediction accuracy by only using 25% of real-world data. We hope our work provides a new perspective on the challenges …
Poster
Yuxiang Fu · Qi Yan · Ke Li · Lele Wang · Renjie Liao
[ ExHall D ]
Abstract
In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues.We propose a novel conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene.We design a novel flow matching loss function that not only ensures at least one of the K sets of future trajectories is accurate but also encourages all K sets of future trajectories to be diverse and plausible.Furthermore, leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA game, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance, generating diverse trajectories that are physically and socially plausible.Moreover, the one-step student model is significantly faster than the teacher flow model in sampling.
Poster
Yuncong Yang · Han Yang · Jiachen Zhou · Peihao Chen · Hongxin Zhang · Yilun Du · Chuang Gan
[ ExHall D ]
Abstract
Constructing a compact and informative scene representation for 3D scenes is essential for effective embodied exploration and reasoning, especially in complex environments over long periods. Existing scene representations, such as object-centric 3D scene graphs, have significant limitations. They oversimplify spatial relationships by modeling scenes as individual objects, with inter-object relationships described by restrictive texts, making it difficult to answer queries that require nuanced spatial understanding. Furthermore, these representations lack natural mechanisms for active exploration and memory management, which hampers their application to lifelong autonomy. In this work, we propose SnapMem, a novel snapshot-based scene representation serving as 3D scene memory for embodied agents. SnapMem employs informative images, termed Memory Snapshots, to capture rich visual information of explored regions. It also integrates frontier-based exploration by introducing Frontier Snapshots—glimpses of unexplored areas—that enable agents to make informed exploration decisions by considering both known and potential new information. Meanwhile, to support lifelong memory in active exploration settings, we further present an incremental construction pipeline for SnapMem, as well as an effective memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that SnapMem significantly enhances agents' exploration and reasoning capabilities in 3D environments over extended periods, highlighting its potential for advancing …
Poster
Xingyu Chen · Zhuheng Song · Xiaoke Jiang · Yaoqing Hu · Junzhi Yu · Lei Zhang
[ ExHall D ]
Abstract
Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6% PCK@0.05 on HInt-Ego4D.
Poster
Zifan Wang · Ziqing Chen · Junyu Chen · Jilong Wang · Yuxin Yang · Yunze Liu · Xueyi Liu · He Wang · Li Yi
[ ExHall D ]
Abstract
This paper introduces MobileH2R, a framework for learning generalizable vision-based human-to-mobile-robot (H2MR) handover skills. Unlike traditional fixed-base handovers, this task requires a mobile robot to reliably receive objects in a large workspace enabled by its mobility. Our key insight is that generalizable handover skills can be developed in simulators using high-quality synthetic data, without the need for real-world demonstrations. To achieve this, we propose a scalable pipeline for generating diverse synthetic full-body human motion data, an automated method for creating safe and imitation-friendly demonstrations, and an efficient 4D imitation learning method for distilling large-scale demonstrations into closed-loop policies with base-arm coordination. Experimental evaluations in both simulators and the real world show significant improvements (at least +15% success rate) over baseline methods in all cases. Experiments also validate that large-scale and diverse synthetic data greatly enhances robot learning, highlighting our scalable framework.
Poster
Jingyi Tian · Le Wang · Sanping Zhou · Sen Wang · lijiayi · Haowen Sun · Wei Tang
[ ExHall D ]
Abstract
Robotic manipulation based on visual observations and natural language instructions is a long-standing challenge in robotics. Yet prevailing approaches model action distribution by adopting explicit or implicit representations, which often struggle to achieve a trade-off between accuracy and efficiency. In response, we propose PDFactor, a novel framework that models action distribution with a hybrid triplane representation. In particular, PDFactor decomposes 3D point cloud into three orthogonal feature planes and leverages a tri-perspective view transformer to produce dense cubic features as a latent diffusion field aligned with observation space representing 6-DoF action probability distribution at an arbitrary location. We employ a small denoising network conceptually as both a parameterized loss function measuring the quality of the learned latent features and an action gradient decoder to sample actions from the latent diffusion field during inference. This design enables our PDFactor to benefit from spatial awareness of explicit representation and arbitrary resolution of implicit representation, rendering it with manipulation accuracy, inference efficiency, and model scalability. Experiments demonstrate that PDFactor outperforms state-of-the-art approaches across a diverse range of manipulation tasks in RLBench simulation. Moreover, PDFactor can effectively learn multi-task policies from a limited number of human demonstrations, achieving promising accuracy in a variety of …
Poster
Jieming Cui · Tengyu Liu · Ziyu Meng · Jiale Yu · Ran Song · Wei Zhang · Yixin Zhu · Siyuan Huang
[ ExHall D ]
Abstract
Learning open-vocabulary physical skills for simulated agents remains challenging due to the limitations of reinforcement learning approaches: manually designed rewards lack scalability, while demonstration-based methods struggle to cover arbitrary tasks. We propose GROVE, a generalized reward framework for open-vocabulary physical skill learning without manual reward design or task-specific demonstrations. GROVE uniquely combines Large Language Models (LLMs) for generating precise constraints with Vision Language Models (VLMs) for semantic evaluation. Through an iterative reward design process, VLM-based feedback guides the refinement of LLM-generated constraints, significantly enhancing the reliability of our method. Central to our approach is Pose2CLIP, a lightweight pose-to-semantic feature mapper that significantly enhances the quality and efficiency of VLM evaluation. Extensive experiments demonstrate GROVE's versatility across diverse tasks and learning paradigms. Our approach achieves 22.2% higher naturalness and 25.7% better task completion score while training 8.4 times faster than previous open-vocabulary methods, establishing a new foundation for scalable physical skill acquisition.
Poster
Chan Hee Song · Valts Blukis · Jonathan Tremblay · Stephen Tyree · Yu Su · Stan Birchfield
[ ExHall D ]
Abstract
Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension — spatial relationships require clear contextual understanding, whether from a ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and ego-centric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
Poster
Yawen Shao · Wei Zhai · Yuhang Yang · Hongchen Luo · Yang Cao · Zheng-Jun Zha
[ ExHall D ]
Abstract
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (Geometry-Intention Collaborative Inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT. The code and dataset will be released.
Poster
He Zhu · Quyu Kong · Kechun Xu · Xunlong Xia · Bing Deng · Jieping Ye · Rong Xiong · Yue Wang
[ ExHall D ]
Abstract
Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings.
Poster
Yueru Jia · Jiaming Liu · Sixiang Chen · Chenyang Gu · Zhilve Wang · Xiaoqi Li · Longzan Luo · Pengwei Wang · Renrui Zhang · Zhongyuan Wang · Shanghang Zhang
[ ExHall D ]
Abstract
3D geometric information is essential for manipulation tasks, as robots need to perceive the 3D environment, reason about spatial relationships, and interact with intricate spatial configurations. Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. Specifically, we first design a task-aware masked autoencoder that masks task-relevant affordance patches and reconstructs depth information, enhancing the 2D foundation model’s implicit 3D robotic representation. After self-supervised fine-tuning, we introduce a 2D model-lifting strategy that establishes a positional mapping between the input 3D points and the positional embeddings of the 2D model. Based on the mapping, Lift3D utilizes the 2D foundation model to directly encode point cloud data, leveraging large-scale pretrained knowledge to construct explicit 3D robotic representations while minimizing spatial information loss. In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.
Poster
Mingjie Pan · Jiyao Zhang · Tianshu Wu · Yinghao Zhao · Wenlong Gao · Hao Dong
[ ExHall D ]
Abstract
The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
Poster
Tomoya Yoshida · Shuhei Kurita · Taichi Nishimura · Shinsuke Mori
[ ExHall D ]
Abstract
Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets --- constructed globally with substantial effort --- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision. Our dataset and code is available upon acceptance.
Poster
Yu Qi · Yuanchen Ju · Tianming Wei · Chi Chu · Lawson L.S. Wong · Huazhe Xu
[ ExHall D ]
Abstract
3D assembly tasks, such as furniture assembly and component fitting, play a crucial role in daily life and represent essential capabilities for future home robots. Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts, which fall short in addressing the complexities of everyday object interactions and assemblies. To bridge this gap, we present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly, covering 18 fine-grained tasks that reflect real-life scenarios, such as plugging into sockets, arranging flowers in vases, and inserting bread into toasters. The 2BY2 dataset contains 1,034 different instances and 517 pairwise objects with pose and symmetry annotations, requiring approaches that align geometric shapes while accounting for functional and spatial relationships between objects. Leveraging the 2BY2 dataset, we introduce a multi-step paired SE(3) pose estimation method that utilizes equivariant geometric features to enforce assembly constraints. Compared to previous shape assembly methods, our approach achieves state-of-the-art performance across all 18 tasks in the 2BY2 dataset, reducing translation RMSE by an average of 0.046 and rotation RMSE by 11.44 across both inter-category and intra-category tasks. Additionally, robot experiments further validate the reliability and generalization ability of our method for complex 3D assembly tasks.
Poster
Qi Lv · Hao Li · Xiang Deng · Rui Shao · Yinchuan Li · Jianye Hao · Longxiang Gao · MICHAEL YU WANG · Liqiang Nie
[ ExHall D ]
Abstract
Despite the significant success of imitation learning in robotic manipulation, its application to bimanual tasks remains highly challenging. Existing approaches mainly learn a policy to predict a distant next-best end-effector pose (NBP) and then compute the corresponding joint rotation angles for motion using inverse kinematics. However, they suffer from two important issues: (1) rarely considering the physical robotic structure}, which may cause self-collisions or interferences, and (2) overlooking the kinematics constraint}, which may result in the predicted poses not conforming to the actual limitations of the robot joints. In this paper, we propose Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser). Specifically, (1)to incorporate the physical robot structure information into action prediction, KStar Diffuser maintains a dynamic spatial-temporal graph according to the physical bimanual joint motions at continuous timesteps. This dynamic graph serves as the robot-structure condition for denoising the actions; (2) to make the NBP learning objective consistent with kinematics, we introduce the differentiable kinematics to provide the reference for optimizing KStar Diffuser. This module regularizes the policy to predict more reliable and kinematics-aware next end-effector poses. Experimental results show that our method effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world.
Poster
Shun Iwase · Zubair Irshad · Katherine Liu · Vitor Guizilini · Robert Lee · Takuya Ikeda · Ayako Amma · Koichi Nishiwaki · Kris Kitani · Rares Andrei Ambrus · Sergey Zakharov
[ ExHall D ]
Abstract
Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which is an order of magnitude larger than existing datasets and comprises 1M photo-realistic images, high-resolution 3D reconstructions and 8.9B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects even when trained only on synthetic data.
Poster
Muchen Li · Sammy Christen · Chengde Wan · Yujun Cai · Renjie Liao · Leonid Sigal · Shugao Ma
[ ExHall D ]
Abstract
Current research on generating 3D hand-object interaction motion primarily focuses on in-domain objects. Generalization to unseen objects is essential for practical applications, yet it remains both challenging and largely unexplored.In this paper, we propose LatentHOI, a novel approach designed to tackle the challenges of generalizing hand-object interaction to unseen objects.Our main insight lies in decoupling high-level temporal motion from fine-grained spatial hand-object interactions with a latent diffusion model coupled with a Grasping Variational Autoencoder (GraspVAE). This configuration not only enhances the conditional dependency between spatial grasp and temporal motion but also improves data utilization and reduces overfitting through regularization in the latent space. We conducted extensive experiments in an unseen-object setting on both single-hand grasping and bi-manual motion datasets, including GRAB, DexYCB, and OakInk.Quantitative and qualitative evaluations demonstrate that our method significantly enhances the realism and physical plausibility of generated motions for unseen objects, both in single and bimanual manipulations, compared to the state-of-the-art.
Poster
Boran Wen · Dingbang Huang · Zichen Zhang · Jiahong Zhou · Jianbin Deng · Jingyu Gong · Yulong Chen · Lizhuang Ma · Yonglu Li
[ ExHall D ]
Abstract
Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images.We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions.Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work.
Poster
Jeongwan On · Kyeonghwan Gwak · Gunyoung Kang · Junuk Cha · Soohyun Hwang · Hyein Hwang · Seungryul Baek
[ ExHall D ]
Abstract
Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (ie. MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in …
Poster
Kefan Chen · Chaerin Min · Linguang Zhang · Shreyas Hampali · Cem Keskin · Srinath Sridhar
[ ExHall D ]
Abstract
Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.
Poster
Rao Fu · Dingxi Zhang · Alex Jiang · Wanjia Fu · Austin Funk · Daniel Ritchie · Srinath Sridhar
[ ExHall D ]
Abstract
Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.
Poster
Buzhen Huang · Chen Li · Chongyang Xu · Dongyue Lu · Jinnan Chen · Yangang Wang · Gim Hee Lee
[ ExHall D ]
Abstract
Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data will be publicly available …
Poster
Jin Lyu · Tianyi Zhu · Yi Gu · Li Lin · Pujin Cheng · Yebin Liu · Xiaoying Tang · Liang An
[ ExHall D ]
Abstract
Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate …
Poster
Andrea Boscolo Camiletto · Jian Wang · Eduardo Alvarado · Rishabh Dabral · Thabo Beeler · Marc Habermann · Christian Theobalt
[ ExHall D ]
Abstract
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data.Existing methods heavily rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs.Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input -- device pose and camera feeds -- is challenging due to the differing characteristics of each data source.To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware.Lastly, we showcase a novel training strategy to enhance the model's generalization capabilities.Our approach exploits the problem's geometric properties, yielding high-quality motion capture free from common artifacts in prior work. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method.We will release data, code, and CAD designs for the …
Poster
Hyunjun Lee · Hyunsoo Lee · Sookwan Han
[ ExHall D ]
Abstract
There have been many attempts to leverage multiple diffusion models for collaborative generations, extending beyond the original domain. One prominent approach is synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing approaches rely on heuristics such as averaging for synchronizing trajectories. Such approaches do not clarify WHY such methods work, and also create many failure cases when the heuristic used on one task is naively applied to other tasks. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works, and prove that heuristics model the correlations between multiple trajectories, hence must be adapted accordingly to each task the synchronization takes place. We attempt to find the best correlation models for each tasks, giving the best results compared to previous approaches which naively applies single heuristics to every tasks without reasoning.
Poster
Jiaman Li · Karen Liu · Jiajun Wu
[ ExHall D ]
Abstract
Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion---including both joint rotations and root trajectories in the world coordinate system---using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
Poster
Kenkun Liu · Yurong Fu · Weihao Yuan · Jing Lin · Peihao Li · Xiaodong Gu · Lingteng Qiu · Haoqian Wang · Zilong Dong · Xiaoguang Han
[ ExHall D ]
Abstract
Existing methods for capturing multi-person holistic human motions from a monocular video usually involve integrating the detector, the tracker, and the human pose & shape estimator into a cascaded system. Differently, we develop a one-stage multi-person holistic human motion capture system, which 1) employs only one network, enabling significant benefits from the end-to-end training on a large-scale dataset; 2) enables performance improving of the tracking module during training, avoiding being limited by a pre-trained tracker; 3) captures the motions of all individuals within a single shot, rather than tracking and estimating each person sequentially. In this system, each query within a temporal cross-attention module is responsible for the long motion of a specific individual, implicitly aggregating individual-specific information throughout the entire video. To further boost the proposed system from end-to-end training, we also construct a synthetic human video dataset, with multi-person and whole-body annotations. Extensive experiments across different datasets demonstrate both the efficacy and the efficiency of both the proposed method and the dataset. The code of our method will be made publicly available.
Poster
Yinhuai Wang · Qihan Zhao · Runyi Yu · Hok Wai Tsui · Ailing Zeng · Jing Lin · Zhengyi Luo · Jiwen Yu · Xiu Li · Qifeng Chen · Jian Zhang · Lei Zhang · Ping Tan
[ ExHall D ]
Abstract
Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose ISMimic, the first data-driven approach that Mimics both human and ball motions to learn diverse Interaction Skills, e.g., a wide variety of challenging basketball skills. ISMimic employs a unified configuration to learn diverse interaction skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple interaction skills and allows smooth skill switching. The interaction skills acquired by ISMimic can be easily reused by a high-level controller to accomplish high-level tasks. To evaluate our approach, we introduce two basketball datasets that collectively contain about 35 minutes of diverse basketball skills. Experiments show that our method can effectively acquire various reusable basketball skills including diverse styles of dribbling, layup, and shooting. Video results and 3D visualization are available at https://ismimic.github.io
Poster
Yingying Fan · Quanwei Yang · Kaisiyuan Wang · Hang Zhou · Yingying Li · Haocheng Feng · Errui Ding · Yu Wu · Jingdong Wang
[ ExHall D ]
Abstract
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To cope with these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we newly design an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout-adjusting strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed Re-HOLD framework significantly outperforms existing methods.
Poster
Peishan Cong · Ziyi Wang · Yuexin Ma · Xiangyu Yue
[ ExHall D ]
Abstract
Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios.
Poster
Xuan Li · Qianli Ma · Tsung-Yi Lin · Yongxin Chen · Chenfanfu Jiang · Ming-Yu Liu · Donglai Xiang
[ ExHall D ]
Abstract
We present Articulated Motion Distillation (AMD), a framework for generating high-fidelity character animations by merging the strengths of skeleton-based animation and modern generative models. AMD uses a skeleton-based representation for rigged 3D assets, drastically reducing the Degrees of Freedom (DoFs) by focusing on joint-level control, which allows for efficient, consistent motion synthesis. Through Score Distillation Sampling (SDS) with pre-trained video diffusion models, AMD distills complex, articulated motions while maintaining structural integrity, overcoming challenges faced by 4D neural deformation fields in preserving shape consistency. This approach is naturally compatible with physics-based simulation, ensuring physically plausible interactions. Experiments show that AMD achieves superior 3D consistency and expressive motion quality compared with existing works on text-to-4D generation.
Poster
Lei Li · Sen Jia · Jianhao Wang · Zhongyu Jiang · Feng Zhou · Ju Dai · Tianfang Zhang · Zongkai Wu · Jenq-Neng Hwang
[ ExHall D ]
Abstract
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model’s ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction.
Poster
Jianrong Zhang · Hehe Fan · Yi Yang
[ ExHall D ]
Abstract
Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task. Our implementation will be released.
Poster
Chi-Pin Huang · Yen-Siang Wu · Hung-Kai Chung · Kai-Po Chang · Fu-En Yang · Yu-Chiang Frank Wang
[ ExHall D ]
Abstract
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions. Our code will be available upon acceptance.
Poster
Kumar Ashutosh · Georgios Pavlakos · Kristen Grauman
[ ExHall D ]
Abstract
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames—capturing physically ungrounded predictions of “what” and ignoring the “where” and “how”. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FICTION that fuses the past video observation of the person’s actions and their environment to predict both the “where” and “how” of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
Poster
Jiuming Liu · Jinru Han · Lihao Liu · Angelica I Aviles-Rivero · Chaokang Jiang · Zhe Liu · Hesheng Wang
[ ExHall D ]
Abstract
Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and ×5.36 speed-up. …
Poster
Vadim Tschernezki · Diane Larlus · Andrea Vedaldi · Iro Laina
[ ExHall D ]
Abstract
Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and segmenting moving objects in particular. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (layered motion fusion). Additionally, we observe that the fusion is limited due to the high complexity of very long dynamic videos, resulting in a lack of captured geometry and in turn a constrained fusion of motion into the (missing) dynamic scene geometry. We address this issue through test-time refinement, which helps the model to focus …
Poster
Haoyue Liu · Jinghan Xu · Yi Chang · Hanyu Zhou · Haozhi Zhao · Lin Wang · Luxin Yan
[ ExHall D ]
Abstract
Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras' advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse motion fields or fuse events with image features to estimate dense motion fields. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events does not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated …
Poster
Zhengfei Kuang · Tianyuan Zhang · Kai Zhang · Hao Tan · Sai Bi · Yiwei Hu · Zexiang Xu · Milos Hasan · Gordon Wetzstein · Fujun Luan
[ ExHall D ]
Abstract
We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
Poster
Min Wu Jeong · Chae Eun Rhee
[ ExHall D ]
Abstract
In this paper, we propose LC-Mamba, a Mamba-basedmodel that captures fine-grained spatiotemporal infor-mation in video frames, addressing limitations in cur-rent interpolation methods and enhancing performance.The main contributions are as follows: First, we apply ashifted local window technique to reduce historical de-cay and enhance local spatial features, allowing multi-scale capture of detailed motion between frames. Sec-ond, we introduce a Hilbert curve-based selective statescan to maintain continuity across window boundaries,preserving spatial correlations both within and betweenwindows. Third, we extend the Hilbert curve to enablevoxel-level scanning to effectively capture spatiotempo-ral characteristics between frames. The proposed LC-Mamba achieves competitive results, with a PSNR of36.53 dB on Vimeo-90k, outperforming prior models by+0.03 dB. The code and models are publicly available athttps://anonymous.4open.science/r/LC-Mamba-FE7C
Poster
Xin Yu · Tianyu Wang · Soo Ye Kim · Paul Guerrero · Xi Chen · Qing Liu · Zhe Lin · Xiaojuan Qi
[ ExHall D ]
Abstract
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage their knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
Poster
Juil Koo · Paul Guerrero · Chun-Hao P. Huang · Duygu Ceylan · Minhyuk Sung
[ ExHall D ]
Abstract
Generative methods for image and video editing use generative models as priors to perform edits despite incomplete information, such as changing the composition of 3D objects shown in a single image. Recent methods have shown promising composition editing results in the image setting, but in the video setting, editing methods have focused on editing object appearance, object motion, or camera motion, and as a result, methods to edit object composition in videos are still missing. We propose VideoHandles as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner. This is achieved by lifting intermediate features of a generative model to a 3D reconstruction that is shared between all frames, editing the reconstruction, and projecting the features on the edited reconstruction back to each frame. To the best of our knowledge, this is the first generative approach to edit object compositions in videos. Our approach is simple and training-free, while outperforming state-of-the-art image editing baselines.
Poster
Jiarui Xu · Shihao Han · Karan Dalal · Daniel Koceja · Yue Zhao · Ka Chun Cheung · Yejin Choi · Jan Kautz · Yu Sun · Xiaolong Wang
[ ExHall D ]
Abstract
We present a novel framework for generating long-form cartoon videos, specifically focusing on recreating the classic "Tom and Jerry" series. While recent advances in video generation have shown promising results for short clips, generating long videos with coherent storylines and dynamic motions remains challenging with high computation costs. We propose a hybrid framework that combines local self-attention with a Test-Time Training (TTT) based global attention mechanism, enabling our model to process and maintain consistency across significantly longer temporal context windows. We develop a new dataset curation pipeline specifically designed for long-form cartoon videos, combining human annotations for complex motion dynamics with Vision-Language Models for detailed descriptions. Our pipeline captures the exaggerated movements and dynamic camera work characteristic of "Tom and Jerry". Experiments show that our approach outperforms existing methods in generating long-form animated content with plausible motion and consistent storylines.
Poster
Shaoteng Liu · Tianyu Wang · Jui-Hsien Wang · Qing Liu · Zhifei Zhang · Joon-Young Lee · Yijun Li · Bei Yu · Zhe Lin · Soo Ye Kim · Jiaya Jia
[ ExHall D ]
Abstract
Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a \textbf{generative video propagation framework}, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object’s shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed …
Poster
Chaoyang Wang · Peiye Zhuang · Tuan Duc Ngo · Willi Menapace · Aliaksandr Siarohin · Michael Vasilkovsky · Ivan Skorokhodov · Sergey Tulyakov · Peter Wonka · Hsin-Ying Lee
[ ExHall D ]
Abstract
We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a newly designed synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization.This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore, GIM-Confidence, and Dust3R-Confidence).
Poster
Guodong Ding · Rongyu Chen · Angela Yao
[ ExHall D ]
Abstract
This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500× while retaining 83\% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.
Poster
Muhammad Umar Karim Khan · Aaron Chadha · Mohammad Ashraful Anam · Yiannis Andreopoulos
[ ExHall D ]
Abstract
Standard video codecs are rate-distortion optimization machines, where distortion is typically quantified using PSNR versus the source. However, it is now widely accepted that increasing PSNR does not necessarily translate to better visual quality. In this paper, a better balance between perception and fidelity is targeted, in order to provide for significant rate savings over state-of-the-art standards-based video codecs. Specifically, pre- and postprocessing neural networks are proposed that enhance the coding efficiency of standard video codecs when benchmarked with an array of well-established perceptual quality scores. These neural wrapper'' elements are end-to-end trained with a neural codec module serving as a differentiable proxy for standard video codecs. The codec proxy is jointly optimized with the pre- and post components, via a novel two-phase pretraining strategy and end-to-end iterative refinement with stop-gradient. This allows the neural pre- and postprocessor to learn to embed, remove and recover information in a codec-aware manner, thus improving its rate-quality performance. A single neural-wrapper model is thereby established and used for the entire rate-quality curve without needing any downscaling or upscaling. The trained model is tested with the AV1 and VVC standard codecs via an array of well-established objective quality scores (SSIM, MS-SSIM, VMAF, AVQT), as …
Poster
Shuoyan Wei · Feng Li · Shengeng Tang · Yao Zhao · Huihui Bai
[ ExHall D ]
Abstract
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic range, exhibit compelling promise in vision tasks. This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. Our approach hinges on two pivotal components: 1) Event-adapted synthesis capitalizes on the spatiotemporal correlations between frames and events to discern and learn long-term motion trajectories, enabling the adaptive interpolation and fusion of informative spatiotemporal features; 2) Local implicit video transformer integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations utilized to generate plausible videos at arbitrary resolutions and frame rates. Experiments show that EvEnhancer achieves superiority on synthetic and real-world datasets and preferable generalizability on out-of-distribution scales against state-of-the-art methods.
Poster
Huimin Zeng · Jiacheng Li · Zhiwei Xiong
[ ExHall D ]
Abstract
As a widely adopted technique in data transmission, video compression effectively reduces the size of files, making it possible for real-time cloud computing. However, it comes at the cost of visual quality, posing challenges to the robustness of downstream vision models. In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. Specifically, the proposed codec-aware framework consists of a compression-aware adaptation (CAA) network that employs a hierarchical adaptation mechanism to estimate parameters of the frame-wise enhancement network, namely the bitstream-aware enhancement (BAE) network. The BAE network further leverages temporal and spatial priors embedded in the bitstream to effectively improve the quality of compressed input frames. Extensive experimental results demonstrate the superior quality enhancement performance of our framework over existing enhancement methods, as well as its versatility in assisting multiple downstream tasks on compressed videos as a plug-and-play module.
Poster
Zongjian Li · Bin Lin · Yang Ye · Liuhan Chen · Xinhua Cheng · Shenghai Yuan · Li Yuan
[ ExHall D ]
Abstract
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2× higher throughput and 4× lower memory consumption while maintaining competitive reconstruction quality. Our code and models will be released to inspire further research.
Poster
Zhuoling Li · Hossein Rahmani · Qiuhong Ke · Jun Liu
[ ExHall D ]
Abstract
Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, through theoretical analysis of the mechanisms behind video generation, we identify two key challenges that hinder short-to-long generalization, namely, temporal position ambiguity and information dilution. To address these challenges, we propose LongDiff, a novel training-free method that unlocks the potential of the off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
Poster
Shian Du · Menghan Xia · Chang Liu · Xintao Wang · Jing Wang · Pengfei Wan · Di ZHANG · Xiangyang Ji
[ ExHall D ]
Abstract
Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR.This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called \textit{PatchVSR}, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused byincomplete semantics of patches.Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame.Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512×512 resolution base model, with extremely high efficiency.
Poster
Henrique Morimitsu · Xiaobin Zhu · Roberto M. Cesar Jr · Xiangyang Ji · Xu-Cheng Yin
[ ExHall D ]
Abstract
Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K (7680 x 4320) resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks. The code and dataset have been submitted as …
Poster
Lianxin Xie · csbingbing zheng · Si Wu · Hau San Wong
[ ExHall D ]
Abstract
Blind Face Video Restoration (BFVR) focuses on reconstructing high-quality facial image sequences from degraded video inputs. The main challenge is address unknown degradations, while maintaining temporal consistency across frames. Current blind face restoration methods are primarily designed for images, and directly applying these approaches to BFVR will encounter a significant drop in restoration performance. In this work, we proposed Dynamic Content Prediction with Motion-aware Priors, referred to as DCP-MP. We develop a motion-aware semantic dictionary by encoding the semantic information of high-quality videos into discrete elements, and capturing the motion information in terms of element relationships, which are derived from the dynamic temporal changes within videos. For the purpose of utilizing dictionary to represent the degraded video, we train a temporal-aware element predictor, conditioned on degraded content, to learn the prediction of discrete elements in dictionary. The predicted elements will be refined, conditioned on motion information captured by the motion-aware semantic dictionary, to enhance temporal coherence. To alleviate deviation from the original structure information, we propose a conditional structure feature correction module that corrects the features flowing from the encoder to the generator. Through extensive experiments, we validate the effectiveness of our design components and demonstrate the superior performance of …
Poster
Haoyan Gong · Zhenrong Zhang · Yuzheng Feng · Anh Nguyen · Hongbin Liu
[ ExHall D ]
Abstract
License plate (LP) recognition is crucial in intelligent traffic management systems. However, factors such as long distances and poor camera quality often lead to severe degradation of captured LP images, posing challenges to accurate recognition. The design of License Plate Image Restoration (LPIR) methods frequently relies on synthetic degraded data, which limits their effectiveness on real-world severely degraded LP images. To address this issue, we introduce the first paired LPIR dataset collected in real-world scenarios, named MDLP, including 10,245 pairs of multi-frame severely degraded LP images and their corresponding clear images. To better restore severely degraded LP, we propose a novel Diffusion-based network, called LP-Diff, to tackle real-world LPIR tasks. Our approach incorporates (1) an Inter-frame Cross Attention Module to fuse temporal information across multiple frames, (2) a Texture Enhancement Module to restore texture information in degraded images, and (3) a Dual-Pathway Fusion Module to select effective features from both channel and spatial dimensions. Extensive experiments demonstrate the reliability of our dataset for model training and evaluation. Our proposed LP-Diff consistently outperforms other state-of-the-art image restoration methods on real-world LPIR tasks. Our dataset and code will be released after the paper is accepted to facilitate reproducibility and future research.
Poster
Kenghong Lin · Baoquan Zhang · Demin Yu · Wenzhi Feng · Shidong Chen · Feifan Gao · Xutao Li · Yunming Ye
[ ExHall D ]
Abstract
Precipitation nowcasting involves using current radar observation sequences to predict future radar sequences and determine future precipitation distribution, which is crucial for disaster warning, traffic planning, and agricultural production. Despite numerous advancements, challenges persist in accurately predicting both the location and intensity of precipitation, as these factors are often interdependent, with complex atmospheric dynamics and moisture distribution causing position and intensity changes to be intricately coupled. Inspired by the fact that in the frequency domain, phase variations are shown to correspond to changes in the position of precipitation, while amplitude variations are linked to intensity changes, we propose an amplitude-phase disentanglement model called AlphaPre, which separately learn the position and intensity changes of precipitation. AlphaPre comprises three key components: a phase network, an amplitude network, and an AlphaMixer. The phase network captures positional changes by learning phase variations, while the amplitude network models intensity changes by alternating between the frequency and spatial domains. The AlphaMixer then integrates these components to produce a refined precipitation forecast. Extensive experiments on four datasets demonstrate the effectiveness and superiority of our method over state-of-the-art approaches.
Poster
Yi Liu · Wengen Li · Jihong Guan · Shuigeng Zhou · Yichao Zhang
[ ExHall D ]
Abstract
Cloud removal (CR) remains a challenging task in remote sensing image processing. Though diffusion models (DMs) have achieved promising progress in image generation, their applications to CR are suboptimal, as they employ the vanilla DMs that generate cloudless images from pure noise, ignoring the valuable information in cloudy images. To overcome this drawback, we develop a new CR method EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a well-elucidated design space with a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. We redesign key MRDM modules to boost CR performance, focusing on restructuring the denoiser and redesigning the training process via preconditioning techniques. We also introduce novel deterministic and stochastic samplers. Additionally, to support the multi-temporal CR task, we develop a denoising network for simultaneously denoising sequential images. We evaluate EMRDM on both mono-temporal and multi-temporal CR tasks. Extensive experiments on various datasets show that EMRDM achieves the state-of-the-art (SOTA) performance.
Poster
Jian Zhu · He Wang · Yang Xu · Zebin Wu · Zhihui Wei
[ ExHall D ]
Abstract
Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion.
Poster
Xueyang Wang · Zhixin Zheng · Jiandong Shao · Yule Duan · Liang-Jian Deng
[ ExHall D ]
Abstract
Recent advancements in convolutional neural network (CNN)-based techniques for remote sensing pansharpening have markedly enhanced image quality. However, conventional convolutional modules in these methods have two critical drawbacks. First, the sampling positions in convolution operations are confined to a fixed square window. Second, the number of sampling points is preset and remains unchanged. Given the diverse object sizes in remote sensing images, these rigid parameters lead to suboptimal feature extraction. To overcome these limitations, we introduce an innovative convolutional module, Adaptive Rectangular Convolution (ARConv). ARConv adaptively learns both the height and width of the convolutional kernel and dynamically adjusts the number of sampling points based on the learned scale. This approach enables ARConv to effectively capture scale-specific features of various objects within an image, optimizing kernel sizes and sampling locations. Additionally, we propose ARNet, a network architecture in which ARConv is the primary convolutional module. Extensive evaluations across multiple datasets reveal the superiority of our method in enhancing pansharpening performance over previous techniques. Ablation studies and visualization further confirm the efficacy of ARConv. The source code will be available at github.
Poster
Guanyao Wu · Haoyu Liu · Hongming Fu · Yichuan Peng · Jinyuan Liu · Xin Fan · Risheng Liu
[ ExHall D ]
Abstract
Multi-modality image fusion, particularly infrared and visible image fusion, plays a crucial role in integrating diverse modalities to enhance scene understanding. Early research primarily focused on visual quality, yet challenges remain in preserving fine details, making it difficult to adapt to subsequent tasks. Recent approaches have shifted towards task-specific design, but struggle to achieve the "The Best of Both Worlds'' due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Establish downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge at the feature, pixel, and contrastive semantic levels, thereby removing reliance on the cumbersome SAM model. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency.
Poster
Donggoo Jung · DAEHYUN KIM · Guanghui Wang · Tae Hyun Kim
[ ExHall D ]
Abstract
Image exposure correction enhances images captured under diverse real-world conditions by addressing issues of under- and over-exposure, which can result in the loss of critical details and hinder content recognition. While significant advancements have been made, current methods often fail to achieve optimal feature learning for effective correction.To overcome these challenges, we propose Exposure-slot, a novel framework that integrates a prompt-based slot-in-slot attention mechanism to cluster exposed feature regions and learn exposure-centric features for each cluster. By extending the Slot Attention algorithm with a hierarchical structure, our approach progressively clusters features, enabling precise and region-aware correction. In particular, learnable prompts tailored to exposure characteristics of slots further enhance feature quality, adapting dynamically to varying conditions. Our method delivers superior performance on benchmark datasets, surpassing the current state-of-the-art with a PSNR improvement of over 1.85 dB on the SICE dataset and 0.4 dB on the LCDP dataset, thereby establishing a new benchmark for multi-exposure correction. The source code will be available upon publication.
Poster
Xin Liu · Jie Liu · Jie Tang · Gangshan Wu
[ ExHall D ]
Abstract
Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregate long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
Poster
Yubin Gu · Yuan Meng · Jiayi Ji · Xiaoshuai Sun
[ ExHall D ]
Abstract
Image restoration (IR), a cornerstone of computer vision, has embarked on a new epoch with the advent of deep learning technologies. Recently, numerous CNN and Transformer-based methods have been developed, yet they frequently encounter limitations in global receptive fields and computational efficiency. To mitigate these challenges, recent studies have employed the Selective Space State Model (Mamba), which embodies both attributes. However, due to Mamba's inherent one-dimensional scanning limitations, some approaches have introduced multi-directional scanning to bolster inter-sequence correlations. Despite these enhancements, these methods still struggle with managing local pixel correlations across various directions. Moreover, the recursive computation in Mamba's SSM leads to reduced efficiency. To resolve these issues, we exploit the mathematical congruences between linear attention and SSM within the Mamba to propose a novel model based on a new design structure, ACL. This model integrates linear attention blocks instead of SSM within the Mamba, serving as the core component of encoders/decoders, and aims to preserve a global perspective while boosting computational efficiency. Furthermore, we have designed a simple yet robust local enhancement module with multi-scale dilated convolutions to extract coarse and fine features to improve local detail recovery. Experimental results confirm that our ACL model excels in classical IR …
Poster
Tong Li · Lizhi Wang · Zhiyuan Xu · Lin Zhu · Wanxuan Lu · Hua Huang
[ ExHall D ]
Abstract
Image denoising enhances image quality, serving as a foundational technique across various computational photography applications. The obstacle to clean image acquisition in real scenarios necessitates the development of self-supervised image denoising methods only depending on noisy images, especially a single noisy image. Existing self-supervised image denoising paradigms (Noise2Noise and Noise2Void) rely heavily on information-lossy operations, such as downsampling and masking, culminating in low quality denoising performance.In this paper, we propose a novel self-supervised single image denoising paradigm, Positive2Negative, to break the information-lossy barrier.Our paradigm involves two key steps: Renoised Data Construction (RDC) and Denoised Consistency Supervision (DCS). RDC renoises the predicted denoised image by the predicted noise to construct multiple noisy images, preserving all the information of the original image. DCS ensures consistency across the multiple denoised images, supervising the network to learn robust denoising. Our Positive2Negative paradigm achieves state-of-the-art performance in self-supervised single image denoising with significant speed improvements. The code is released to the public at https://anonymous.4open.science/r/P2N-4C8E.
Poster
Chen Zhao · Zhizhou Chen · Yunzhe Xu · Enxuan Gu · Jian Li · Zili Yi · qian Wang · Jian Yang · Ying Tai
[ ExHall D ]
Abstract
Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed Kolmogorov-Arnold Networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component.
Poster
Muhammad Jamal Jamal · Omid Mohareri
[ ExHall D ]
Abstract
In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.
Poster
MinKyu Lee · Sangeek Hyun · Woojin Jun · Jae-Pil Heo
[ ExHall D ]
Abstract
This work tackles the fidelity objective in the perceptual super-resolution (SR).Specifically, we address the shortcomings of pixel-level Lp loss (Lpix) in the GAN-based SR framework.Since Lpix is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters.However, this work shows that these circumventions fail to address the fundamental factor that induces blurring.Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of Lpix that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship.We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with Lpix. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss (LAESOP), a novel loss function that measures distance in the AE space, instead of the raw pixel space. (AE space indicates the space after the decoder, not the bottleneck.)By simply substituting the conventional Lpix with LAESOP, we can provide effective reconstruction guidance without compromising perceptual quality.Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify the significance of our method in enhancing both fidelity and perceptual quality.
Poster
I-Hsiang (Aaron) Chen · Wei-Ting Chen · Yu-Wei Liu · Yuan-Chun Chiang · Sy-Yen Kuo · Ming-Hsuan Yang
[ ExHall D ]
Abstract
Image restoration aims to recover content from inputs degraded by various factors, such as adverse weather, blur, and noise. Perceptual Image Restoration (PIR) methods improve visual quality but often do not support downstream tasks effectively. On the other hand, Task-oriented Image Restoration (TIR) methods focus on enhancing image utility for high-level vision tasks, sometimes compromising visual quality. This paper introduces UniRestore, a unified image restoration model that bridges the gap between PIR and TIR by using a diffusion prior. The diffusion prior is designed to generate images that align with human visual quality preferences, but these images are often unsuitable for TIR scenarios. To solve this limitation, UniRestore utilizes encoder features from an autoencoder to adapt the diffusion prior to specific tasks. We propose a Complementary Feature Restoration Module (CFRM) to reconstruct degraded encoder features and a Task Feature Adapter (TFA) module to facilitate adaptive feature fusion in the decoder. This design allows UniRestore to optimize images for both human perception and downstream task requirements, addressing discrepancies between visual quality and functional needs. Integrating these modules also enhances UniRestore’s adapability and efficiency across diverse tasks. Extensive expertments demonstrate the superior performance of UniRestore in both PIR and TIR scenarios.
Poster
Leheng Zhang · Weiyi You · Kexuan Shi · Shuhang Gu
[ ExHall D ]
Abstract
Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various …
Poster
Wenhao Shen · Mingliang Zhou · Yu Chen · Xuekai WEI · Yong Feng · Huayan Pu · Weijia Jia
[ ExHall D ]
Abstract
Existing full-reference image quality assessment (FR-IQA) methods often fail to capture the complex causal mechanisms that underlie human perceptual responses to image distortions, limiting their ability to generalize across diverse scenarios. In this paper, we propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions. First, we explore the causal effects of deep features on perception and integrate causal reasoning with feature comparison, constructing a model that effectively handles complex distortion types across different IQA scenarios. Second, the analysis of the perceptual causal correlations of our proposed method is independent of the backbone architecture and thus can be applied to a variety of deep networks. Through abductive counterfactual experiments, we validate the proposed causal relationships, confirming the model's superior perceptual relevance and interpretability of quality scores. The experimental results demonstrate the robustness and effectiveness of the method, providing competitive quality predictions across multiple benchmarks. The source code is available at https://anonymous.4open.science/r/DeepCausalQuality-25BC.
Poster
Chen Liao · Yan Shen · Dan Li · Zhongli Wang
[ ExHall D ]
Abstract
Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct …
Poster
Zhiyuan Chen · Keyi Li · Yifan Jia · Le Ye · Yufei Ma
[ ExHall D ]
Abstract
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. For 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase.
Poster
Ping Chen · Xingpeng Zhang · Zhaoxiang Liu · Huan Hu · Xiang Liu · Kai Wang · Min Wang · Yanlin Qian · Shiguo Lian
[ ExHall D ]
Abstract
In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and quality. Drawing on Denoising Diffusion Implicit Models (DDIM) and insights from graph theory, our model, termed the Shortest Path Diffusion Model (ShortDF), treats the denoising process as a shortest-path problem aimed at minimizing reconstruction error. By optimizing the initial residuals, we improve the efficiency of the reverse diffusion process and the quality of the generated samples. Extensive experiments on multiple standard benchmarks demonstrate that ShortDF significantly reduces diffusion time (or steps) while enhancing the visual fidelity of generated samples compared to prior methods. This work, we suppose, paves the way for interactive diffusion-based applications and establishes a foundation for rapid data generation.
Poster
Kendong Liu · Zhiyu Zhu · Hui LIU · Junhui Hou
[ ExHall D ]
Abstract
We present Acc3D to tackle the challenge of accelerating the diffusion process for generating 3D models from single images. To derive accurate reconstruction through few-step inference, we emphasize the critical issue as the modeling of the score function at the endpoints (states of the random noise). To tackle such an issue, we propose edge consistency, i.e., consistent predictions across the low signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we introduce an adversarial augmentation strategy to further enrich generation detail. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments show that our Acc3D not only achieves over a 20× increase in computational efficiency but also yields notable quality improvements, compared with state-of-the-art methods. Project webpage: https://acc3d-object.github.io/
Poster
Fanhu Zeng · Hao Tang · Yihua Shao · Siyu Chen · Ling Shao · Yan Wang
[ ExHall D ]
Abstract
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we systematically analyze the advantages of SSMs for better integration and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code will be made publicly available.
Poster
Jinchang Xu · Shaokang Wang · Jintao Chen · Zhe Li · Peidong Jia · Fei Zhao · Guoqing Xiang · Zhijian Hao · Shanghang Zhang · Xiaodong Xie
[ ExHall D ]
Abstract
Leveraging the generative power of diffusion models, generative image compression has achieved impressive perceptual fidelity even at extremely low bitrates. However, current methods often neglect the non-uniform complexity of images, limiting their ability to balance global perceptual quality with local texture consistency and to allocate coding resources efficiently. To address this, we introduce the Map-guided Masking Realism Image Diffusion Codec(MRIDC), designed to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. On the encoding side, we propose a Map-guided Latent Masking(MLM) module, which selectively masks elements in the latent space based on prior information, allowing adaptive resource allocation aligned with image complexity. On the decoding side, masked latents are completed using the Bidirectional Prediction Controllable Generation(BPCG) module, which guides the constrained generation process within the diffusion model to reconstruct the image. Experimental results show that MRIDC achieves state-of-the-art perceptual compression quality at extremely low bitrates, effectively preserving feature consistency in key regions and advancing the rate-distortion-perception performance curve, establishing new benchmarks in balancing compression efficiency with visual fidelity.
Poster
Emiel Hoogeboom · Thomas Mensink · Jonathan Heek · Kay Lamerigts · Ruiqi Gao · Tim Salimans
[ ExHall D ]
Abstract
Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600.We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss (Kingma and Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at high resolution with fewer parameters, rather than using more parameters but at a lower resolution. When combining these three steps with recently proposed tricks like guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).
Poster
Haoran You · Connelly Barnes · Yuqian Zhou · Yan Kang · Zhenbang Du · Wei Zhou · Lingzhi Zhang · Yotam Nitzan · Xiaoyang Liu · Zhe Lin · Eli Shechtman · Sohrab Amirghodsi · Yingyan (Celine) Lin
[ ExHall D ]
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One key efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in Mixture-of-Depths (MoD) efficient DiT models. Specifically, DiffRatio-MoD integrates three features:(1) A token-level routing scheme where each DiT layer includes a router that is jointly fine-tuned with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the …
Poster
Haipeng Fang · Sheng Tang · Juan Cao · Enshuo Zhang · Fan Tang · Tong-yee Lee
[ ExHall D ]
Abstract
Diffusion transformers have shown exceptional performance in visual generation but are accompanied by high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, which achieves 1.55 times acceleration with negligible impact on image quality.
Poster
Longquan Dai · He Wang · Jinhui Tang
[ ExHall D ]
Abstract
In training-free conditional generative tasks, diffusion models utilize differentiable loss functions to steer the generative reverse process, necessitating modifications to sampling algorithms like DDPM and DDIM. However, such adjustments likely reduce flexibility and reliability. In this paper, we propose NoiseCtrl, a sampling-algorithm-agnostic technique for controlled image generation. Essentially, diffusion models generate denoised results zt−1 by adding a predicted mean μt with random noise ϵt. NoiseCtrl specifically adjusts the random noise while leaving the underlying sampling algorithms unchanged. At each step t, NoiseCtrl converts the unconditional Gaussian noise into conditional noise ϵ′t by substituting the isotropic Gaussian distribution with the von Mises–Fisher distribution. This substitution introduces a directional focus while preserving the randomness required for conditional image generation. Thanks to this non-intrusive design, NoiseCtrl is straightforward to integrate and has been extensively validated through experiments, demonstrating its adaptability for different diffusion algorithms and superior performance across various conditional generation tasks.
Poster
Yunpeng Liu · Boxiao Liu · Yi Zhang · Xingzhong Hou · Guanglu Song · Yu Liu · Haihang You
[ ExHall D ]
Abstract
Significant advances have been made in the sampling efficiency of diffusion and flow matching models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at a later timestep. However, we found that the knowledge discrepancy between student and teacher varies significantly across different timesteps, leading to suboptimal performance in CD.To address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the knowledge discrepancy across timesteps. Specifically, we regard the distillation process at each timestep as a curriculum and introduce a metric based on the Peak Signal-to-Noise Ratio (PSNR) to quantify the knowledge discrepancy of this curriculum, then ensure that the curriculum maintains consistent knowledge discrepancy across different timesteps by having the teacher model iterate more steps when the noise intensity is low.Our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet 64x64.Moreover, we have extended our method to large-scale text-to-image models and confirmed that it generalizes well to both diffusion models (Stable Diffusion XL) and flow matching models (Stable Diffusion 3). The generated samples demonstrate improved image-text alignment and semantic structure since CCM enlarges the distillation step at …
Poster
Huiyang Shao · Xin Xia · Yuhong Yang · Ren Yuxi · XING WANG · Xuefeng Xiao
[ ExHall D ]
Abstract
Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method maximizes the reduction of sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.
Poster
Pingyu Wu · Kai Zhu · Yu Liu · Liming Zhao · Wei Zhai · Yang Cao · Zheng-Jun Zha
[ ExHall D ]
Abstract
Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most existing video VAEs inflate a pre-trained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in the above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks …
Poster
Maosen Zhao · Pengtao Chen · Chong Yu · Yan Wen · Xudong Tan · Tao Chen
[ ExHall D ]
Abstract
Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization. Our code will be publicly available soon.
Poster
Gongfan Fang · Kunjun Li · Xinyin Ma · Xinchao Wang
[ ExHall D ]
Abstract
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2× speedup with an FID score of 2.86, outperforming competitors with comparable efficiency.
Poster
Yuanyang Yin · Yaqi Zhao · Mingwu Zheng · Ke Lin · Jiarong Ou · Rui Chen · Victor Shea-Jay Huang · Jiahao Wang · Xin Tao · Pengfei Wan · Di ZHANG · Baoqun Yin · Wentao Zhang · Kun Gai
[ ExHall D ]
Abstract
Achieving optimal performance of video diffusion transformers within given data and compute budgets is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size—two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among test loss, any model size, and training budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
Poster
Kaibo Zhao · Liang Bao · Yufei Li · Xu Su · Ke Zhang · Xiaotian Qiao
[ ExHall D ]
Abstract
Image vectorization aims to convert raster images to vector ones, allowing for easy scaling and editing.Existing works mainly rely on preset parameters (i.e., a fixed number of paths and control points), ignoring the complexity of the image and posing significant challenges to practical applications.We demonstrate that such an assumption is often incorrect, as the preset paths or control points may be neither essential nor enough to achieve accurate and editable vectorization results.Based on this key insight, in this paper, we propose an efficient image vectorization method with adaptive parametrization, where the paths and control points can be adjusted dynamically based on the complexity of the input raster image.In particular, we first decompose the input raster image into a set of pure-colored layers that are aligned with human perception.For each layer with varying shape complexity, we propose a novel allocation mechanism to adaptively adjust the control point distribution.We further adopt a differentiable rendering process to compose and optimize the shape and color parameters of each layer iteratively.Extensive experiments demonstrate that our method outperforms the baselines qualitatively and quantitatively, in terms of computational efficiency, vectorization accuracy, and editing flexibility.
Poster
Mohd Hozaifa Khan · Ravi Kiran Sarvadevabhatla
[ ExHall D ]
Abstract
We introduce **Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions** in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over **20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks**. We introduce **multimodal foundational agents** with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes **800 human-agent sessions** for benchmarking the agents. We introduce **novel metrics** to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing **a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents**.
Poster
Yihao Meng · Hao Ouyang · Hanlin Wang · Qiuyu Wang · Wen Wang · Ka Leong Cheng · Zhiheng Liu · Yujun Shen · Huamin Qu
[ ExHall D ]
Abstract
The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, Anidoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. We will make the model public to facility the community.
Poster
Guy Yariv · Yuval Kirstain · Amit Zohar · Shelly Sheynin · Yaniv Taigman · Yossi Adi · Sagie Benaim · Adam Polyak
[ ExHall D ]
Abstract
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios.To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce SA-V-128, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate …
Poster
Tongtong Su · Chengyu Wang · Bingyan Liu · Jun Huang · Dongming Lu
[ ExHall D ]
Abstract
In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce \emph{EVS}, a training-free \underline{E}ncapsulated \underline{V}ideo \underline{S}ynthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, \emph{EVS} successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches.Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time.~\footnote{Source codes will be released upon paper acceptance.}
Poster
Yeongmin Kim · Sotiris Anagnostidis · Yuming Du · Edgar Schoenfeld · Jonas Kohler · Markos Georgopoulos · Albert Pumarola · Ali Thabet · Artsiom Sanakoyeu
[ ExHall D ]
Abstract
Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a 5× reduction in FID degradation compared …
Poster
Diljeet Jagpal · Xi Chen · Vinay P. Namboodiri
[ ExHall D ]
Abstract
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image-generation models, which limit their adaptability and scalability.In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, Our approach can ensure appropriate control between coherence and variance for the frames.Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model’s superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
Poster
Luozhou Wang · Yijun Li · ZhiFei Chen · Jui-Hsien Wang · Zhifei Zhang · He Zhang · Zhe Lin · Ying-Cong Chen
[ ExHall D ]
Abstract
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Poster
Xiang Gao · Shuai Yang · Jiaying Liu
[ ExHall D ]
Abstract
Optical illusion hidden picture is an interesting visual perceptual phenomenon where an image is cleverly integrated into another picture in a way that is not immediately obvious to the viewer. Established on off-the-shelf text-to-image diffusion model, we propose a novel \textbf{P}hase-\textbf{T}ransferred \textbf{Diffusion} Model (PTDiffusion) for hidden art syntheses. PTDiffusion embeds an input reference image into arbitrary scenes that are faithful to text prompts, while exhibiting hidden visual cues of the reference image. At the heart of our method is a plug-and-play phase transfer mechanism that dynamically and progressively transplants diffusion features' phase spectrum from the denoising process to reconstruct the reference image into the one to sample the illusion picture, realizing harmonious fusion of the reference structural information and the target semantic information. Furthermore, we propose an asynchronous phase transfer mechanism to flexibly control the degree of hidden image discernability. Our method is training-free, all while substantially outperforming related methods in image quality, text fidelity, visual discernibility, and contextual naturalness for illusion picture synthesis, as fully demonstrated by extensive qualitative and quantitative experiments.
Poster
Hyunsoo Kim · Donghyun Kim · Suhyun Kim
[ ExHall D ]
Abstract
How can we generate an image B′ that satisfies A:A′::B:B′, given the input images A,A′ and B?Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (\eg InstructPix2Pix. Inpainting models) rather than general diffusion models (\eg Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A′ and applies it to B to generate a plausible B′. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A′ and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate …
Poster
ruojun xu · Weijie Xi · Xiaodi Wang · Yongbo Mao · Zach Cheng
[ ExHall D ]
Abstract
Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.
Poster
Yang Zhou · Xu Gao · Zichong Chen · Hui Huang
[ ExHall D ]
Abstract
Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis.
Poster
Jihun Park · Jongmin Gim · Kyoungmin Lee · Seunghun Lee · Sunghoon Im
[ ExHall D ]
Abstract
We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs.The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions.Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image’s background. This loss is applied to dynamically identified background areas.Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.
Poster
Suho Ryu · Kihyun Kim · Eugene Baek · Dongsoo Shin · Joonseok Lee
[ ExHall D ]
Abstract
A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.
Poster
Weicheng Wang · Guoli Jia · Zhongqi Zhang · Liang Lin · Jufeng Yang
[ ExHall D ]
Abstract
Diffusion models pre-trained on large-scale paired image-text data achieve significant success in image editing. To convey more fine-grained visual details, subject-driven editing integrates subjects in user-provided reference images into existing scenes. However, it is challenging to obtain photorealistic results, which simulate contextual interactions, such as reflections, illumination, and shadows, induced by merging the target object into the source image. To address this issue, we propose PS-Diffusion, which ensures realistic and consistent object-scene blending while maintaining invariance of subject appearance during editing. Specifically, we first divide the contextual interactions into those occurring in the foreground and the background areas. The effect of the former is estimated through intrinsic image decomposition, and the region of the latter is predicted in an additional background effect control branch. Moreover, we propose an effect attention module to disentangle the learning processes of interaction and subject, alleviating confusion between them. Additionally, we introduce a synthesized dataset, Replace-5K, consisting of 5,000 image pairs with invariant subject and contextual interactions via 3D rendering. Extensive quantitative and qualitative experiments on our dataset and two real-world datasets demonstrate that our method achieves state-of-the-art performance. The source code is provided in the supplementary materials and will be publicly available.
Poster
Navve Wasserman · Noam Rotstein · Roy Ganz · Ron Kimmel
[ ExHall D ]
Abstract
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. Our quantitative and qualitative results show that the trained model surpasses existing models in both object addition and general editing tasks. To propel future research, we will release the dataset alongside the trained models.
Poster
jun huang · Ting Liu · Yihang Wu · Xiaochao Qu · Luoqi Liu · Xiaolin Hu
[ ExHall D ]
Abstract
Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce the MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million masks-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present the combination of self-attention mechanisms and a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.
Poster
Yizhe Tang · Zhimin Sun · Yuzhen Du · Ran Yi · Guangben Lu · Teng Hu · LUYING LI · Lizhuang Ma · FangYuan Zou
[ ExHall D ]
Abstract
Image inpainting aims to fill the missing region of an image.Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided.Existing background inpainting methods typically strictly preserve the subject's original position from the source image,resulting in inconsistencies between the subject and the generated background.To address this challenge, we propose a new task, the Text-Guided Subject-Position Variable Background Inpainting'', which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject andthe inpainted background, and propose the Adaptive Transformation Agent (ATA) for this task.Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position.Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information.Thirdly, we equip ATA with a Position Switch Embedding to control whether the subject's position in the generated image is adaptively predicted or fixed.Extensive comparative experiments validate the effectiveness of our ATA approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but …
Poster
Bolin Lai · Felix Juefei-Xu · Miao Liu · Xiaoliang Dai · Nikhil Mehta · Chenguang Zhu · Zeyi Huang · James Rehg · Sangmin Lee · Ning Zhang · Tong Xiao
[ ExHall D ]
Abstract
Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed InstaManip, that can instantly learn a new image manipulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (>=19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
Poster
Pu Cao · Feng Zhou · Lu Yang · TianruiHuang · Qing Song
[ ExHall D ]
Abstract
In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation.Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities.To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective.We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance.Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion …
Poster
Qihan Huang · Weilong Dai · Jinlong Liu · Wanggui He · Hao Jiang · Mingli Song · Jie Song
[ ExHall D ]
Abstract
Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision models with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance …
Poster
Dong Liang · Jinyuan Jia · Yuhao Liu · Zhanghan Ke · Hongbo Fu · Rynson W.H. Lau
[ ExHall D ]
Abstract
Recent advancements in diffusion models have significantly enhanced the performance of text-to-image models in image synthesis. To enable control over the the spatial locations of the generated objects,diffusion-based methods typically utilizeobject layout as an auxiliary input. However, we observe that this approach treats all objects as being on the same layer and neglect their visibility order, leading to the synthesis of overlapping objects with incorrect occlusions.To address this limitation, we introduce in this paper a new training-free framework that considers object visibility order explicitly and allows users to place overlapping objects in a stack of layers. Our framework consists of two visibility-based designs. First, we propose a novel Sequential Denoising Process (SDP) to divide the whole image generation into multiple stages for different objects, each stage primarily focuses on an object. Second, we propose a novel Visibility-Order-Aware (VOA) Loss to transform the layout and occlusion constraints into an attention map optimization process to improve the accuracy of synthesizing object occlusions in complex scenes. By merging these two novel components, our framework, dubbed VODiff, enables the generation of photorealistic images that satisfy user-specified spatial constraints and object occlusion relationships. In addition, we introduce VOBench, a diverse benchmark dataset containing 200 curated …
Poster
Yingying Deng · Xiangyu He · Fan Tang · Weiming Dong
[ ExHall D ]
Abstract
The customization of multiple attributes has gained increasing popularity with the rising demand for personalized content creation. Despite promising empirical results, the contextual coherence between different attributes has been largely overlooked. In this paper, we argue that subsequent attributes should follow the multivariable conditional distribution introduced by former attributes creation. In light of this, we reformulate multi-attribute creation from a conditional probability theory perspective and tackle the challenging zero-shot setting. By explicitly modeling the dependencies between attributes, we further enhance the coherence of generated images across diverse attribute combinations. Furthermore, we identify connections between multi-attribute customization and multi-task learning, effectively addressing the high computing cost encountered in multi-attribute synthesis. Extensive experiments demonstrate that Z-Magic outperforms existing models in zero-shot image generation, with broad implications for AI-driven design and creative applications.
Poster
Jian Han · Jinlai Liu · Yi Jiang · Bin Yan · Yuqi Zhang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu
[ ExHall D ]
Abstract
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity refactors visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary classifier and bitwise self-correction mechanism. By theoretically expanding the tokenizer vocabulary size to infinity in Transformer, our method significantly unleashes powerful scaling capabilities to infinity compared to vanilla VAR. Extensive experiments indicate Infinity outperforms AutoRegressive Text-to-Image models by large margins, matches or surpasses leading diffusion models. Without extra optimization, Infinity generates a 1024×1024 image in 0.8s, 2.6× faster than SD3-Medium, making it the fastest Text-to-Image model. Models and codes will be released to promote the further exploration of Infinity for visual generation.
Poster
Woojung Han · Yeonkyung Lee · Chanyoung Kim · Kwanghyun Park · Seong Jae Hwang
[ ExHall D ]
Abstract
Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges such as "missing objects'' and "mismatched attributes,'' another critical issue of "mislocated objects'' remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a custom Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis. The source code will be publicly released.
Poster
Jiapeng Zhu · Ceyuan Yang · Kecheng Zheng · Yinghao Xu · Zifan Shi · Yifei Zhang · Qifeng Chen · Yujun Shen
[ ExHall D ]
Abstract
Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling out of grace with the task of text-conditioned image synthesis. Sparsely activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited resources. Inspired by this, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to adaptively select the most suitable expert for each feature point. We adopt a two-stage training strategy, which first learns a base model at 64×64 resolution followed by an upsampler to produce 512×512 images. Trained with only public data, our approach encouragingly closes the performance gap between GANs and industry-level diffusion models, maintaining a fast inference speed. We will release the code and checkpoints to facilitate the community for more comprehensive studies of GANs.
Poster
Qingyu Shi · Lu Qi · Jianzong Wu · Jinbin Bai · Jingbo Wang · Yunhai Tong · Xiangtai Li
[ ExHall D ]
Abstract
Customized image generation is essential for delivering personalized content based on user-provided prompts, enabling large-scale text-to-image diffusion models to better align with individual needs. However, existing models often neglect the relationships between customized objects in generated images. In contrast, this work addresses this gap by focusing on relation-aware customized image generation, which seeks to preserve the identities from image prompts while maintaining the predicate relations specified in text prompts. Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges—generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on our proposed benchmarks demonstrate the superiority of DreamRelation in generating precise relations while preserving object identities across a diverse set …
Poster
Kaiwen Zha · Lijun Yu · Alireza Fathi · David A. Ross · Cordelia Schmid · Dina Katabi · Xiuye Gu
[ ExHall D ]
Abstract
Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focusing on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2\% and 48.1\% on ImageNet 256×256 and 512×512 benchmarks respectively, across varying number of tokens. These tokenization improvements consistently translate to 16.3\% and 34.3\% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve 93.5× inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 …
Poster
Lifu Wang · Daqing Liu · Xinchen Liu · Xiaodong He
[ ExHall D ]
Abstract
Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.
Poster
Shengqu Cai · Eric Ryan Chan · Yunzhi Zhang · Leonidas Guibas · Jiajun Wu · Gordon Wetzstein
[ ExHall D ]
Abstract
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
Poster
Fu Feng · Yucheng Xie · Xu Yang · Jing Wang · Xin Geng
[ ExHall D ]
Abstract
'Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-domain concepts like ''a blue banana'', they struggle with generating combinatorial objects such as ''a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth of ''creative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output---a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ''creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that \texttt{<CreTok>} enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training (4s vs. BASS's 2400s per image), achieving state-of-the-art performance with improved text-image alignment (↑0.03 in VQAScore) and higher human preference ratings (↑0.009 in PickScore and ↑0.169 in ImageReward). Further evaluations with GPT-4o …</cretok></cretok>
Poster
Shulei Wang · w l · Hai Huang · Hanting Wang · Sihang Cai · WenKang Han · Tao Jin · Jingyuan Chen · Jiacheng Sun · Jieming Zhu · Zhou Zhao
[ ExHall D ]
Abstract
We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.
Poster
Kyungmin Lee · Xiaohang Li · Qifei Wang · Junfeng He · Junjie Ke · Ming-Hsuan Yang · Irfan Essa · Jinwoo Shin · Feng Yang · Yinxiao Li
[ ExHall D ]
Abstract
Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.
Poster
Keyu Tu · Mengqi Huang · Zhuowei Chen · Zhendong Mao
[ ExHall D ]
Abstract
Large-scale text-to-image models evolve rapidly in size and architecture. The existing adapters struggle to keep pace with these models, requiring extensive retraining. This paper proposes a novel adapter transfer framework, A4A (Adapter for Adapter), which uses an all-for-all mapping approach to seamlessly transfer attention-based adapters across different model architectures (e.g., U-Net to transformer). The framework consists of Coupling Space Projection and Upgraded Space Mapping. During Coupling Space Projection, all attention features of the pre-trained adapter are collected to capture the complete coupling relationship with the base model and then projected into the unified space. Randomly initialized learnable features in the upgraded model are introduced to connect the unified space and upgraded space. By integrating the reference features through the attention mechanism and aligning them with the upgraded architecture, the learnable features bridge the discrepancies between the models. Experimental results on personalized image generation tasks demonstrate that A4A outperforms previous methods in transferring adapters while being the first to achieve adapter transfer across model architectures.
Poster
Xiaoying Xing · Avinab Saha · Junfeng He · Susan Hao · Paul Vicol · Moonkyung Ryu · Gang Li · Sahil Singla · Sarah Young · Yinxiao Li · Feng Yang · Deepak Ramachandran
[ ExHall D ]
Abstract
Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image.
Poster
Leigang Qu · Haochuan Li · Wenjie Wang · Xiang Liu · Juncheng Li · Liqiang Nie · Tat-seng Chua
[ ExHall D ]
Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation.However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (**SILMM**) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging.To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30\% on T2I-CompBench++ and around 20\% on DPG-Bench.
Poster
Chongjian GE · Chenfeng Xu · Yuanfeng Ji · Chensheng Peng · Masayoshi Tomizuka · Ping Luo · Mingyu Ding · Varun Jampani · Wei Zhan
[ ExHall D ]
Abstract
Recent breakthroughs in text-guided image generation have significantly advanced the field of 3D generation. While generating a single high-quality 3D object is now feasible, generating multiple objects with reasonable interactions within a 3D space, a.k.a. compositional 3D generation, presents substantial challenges. This paper introduces CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. To achieve this goal, two core designs are proposed: (1) *3D Gaussians Initialization with 2D compositionality*: We transfer the well-established 2D compositionality to initialize the Gaussian parameters on an entity-by-entity basis, ensuring both consistent 3D priors for each entity and reasonable interactions among multiple entities; (2) *Dynamic Optimization*: We propose a dynamic strategy to optimize 3D Gaussians using Score Distillation Sampling (SDS) loss. CompGS first automatically decomposes 3D Gaussians into distinct entity parts, enabling optimization at both the entity and composition levels. Additionally, CompGS optimizes across objects of varying scales by dynamically adjusting the spatial parameters of each entity, enhancing the generation of fine-grained details, particularly in smaller entities. Qualitative comparisons and quantitative evaluations on T3Bench demonstrate the effectiveness of CompGS in generating compositional 3D objects with superior image quality and semantic alignment over existing methods. CompGS can also …
Poster
Yiming Qin · Zhu Xu · Yang Liu
[ ExHall D ]
Abstract
In recent years, text-to-3D generation has made great progress and can generate many exquisite 3D objects. However, due to the weakness of text encoder in processing long text, the task of text-to-3D generation with complex attributes has always encountered great difficulties. In particular, when there are serious occlusion relationships between these complex attributes, the results will be worse. Therefore, we propose a new method called Hierarchical-Chain-of-Generation (HCoG), which needs no manual efforts, utilizing large language model to decompose complex target objects into a hierarchical generation chain, so that each part can be better generated. Furthermore, for each split text, SAM automatically find the corresponding region and optimize 3D Gaussian kernels in this region by a controllable way. In addition, to generate new parts in the hierarchical chain, we need to preserve previous parts and optimize new parts. Therefore, we propose the Label Elimination to ensure new parts will not attach to the surface of the previous parts and change them. Experiment demonstrates that HCoG is an end-to-end automatic framework for advanced complex attributes text-to-3D generation while effectively handling situations where there are a lot of occlusions between attributes and ensuring high quality of the results.
Poster
Yidi Li · Jun Xiao · Zhengda Lu · Yiqun Wang · Haiyong Jiang
[ ExHall D ]
Abstract
This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.
Poster
Chen Liang · Lianghua Huang · Jingwu Fang · Huanzhang Dou · Wei Wang · Zhi-Fan Wu · Yupeng Shi · Junge Zhang · Xin Zhao · Yu Liu
[ ExHall D ]
Abstract
Recent advancements in image generation models enable the creation of high-quality images and targeted modifications based on textual instructions. Some models even support multimodal complex guidance and demonstrate robust task generalization capabilities. However, they still fall short of meeting the nuanced, professional demands of designers. To bridge this gap, we introduce IDEA-Bench, a comprehensive benchmark designed to advance image generation models toward applications with robust task generalization. IDEA-Bench comprises 97 professional image generation tasks and 266 specific cases, categorized into five major types based on the current capabilities of existing models. Furthermore, we provide a representative subset of 18 tasks with enhanced evaluation criteria to facilitate more nuanced and reliable evaluations using Multimodal Large Language Models (MLLMs). By assessing models' ability to comprehend and execute novel, complex tasks, IDEA-Bench paves the way toward the development of generative models with autonomous and versatile visual generation capabilities.
Poster
Shalini Maiti · Lourdes Agapito · Filippos Kokkinos
[ ExHall D ]
Abstract
The rapid advancements in text-to-3D generation necessitate robust and scalable evaluation metrics that align closely with human judgment—a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality—by analyzing 3D surface normals—without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences.Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, establishing itself as a comprehensive and accessible benchmark for future research in text-to-3D generation. To support and encourage further research in this field, we will release both our code and benchmark, establishing Gen3DEval as a comprehensive and accessible tool for text-to-3D evaluation.
Poster
Jiahao Li · Weijian Ma · Xueyang Li · Yunzhong Lou · Guichun Zhou · Xiangdong Zhou
[ ExHall D ]
Abstract
Large Language Models (LLMs) have achieved significant success recently, leading to growing enthusiasm to expand their generative capabilities from general text to specialized domains. This paper discusses generating parametric sequences of computer-aided design (CAD) models via LLMs. This serves as an entry point for 3D shape generation by LLMs, as CAD model parameters directly map to shapes in 3D space. This task is non-trivial despite LLMs' strong generative abilities, as they have neither encountered parametric sequences during pretraining nor can they directly perceive 3D shapes. Therefore, we propose CAD-Llama, a paradigm empowering pretrained LLMs for parametric 3D CAD model generation. Specifically, we introduce a code-like format to unify parametric 3D CAD command sequences and a hierarchical annotation pipeline to convert intricate parametric 3D CAD shapes into Python-like pseudo-code, an LLM-friendly text format. Additionally, we propose adaptive pre-training on Structured Parametric CAD Code (SPCC) and fine-tuning with CAD-related instructions to imbue LLMs with spatial information conveyed in parametric sequences. Experimental results show that our framework significantly outperforms previous autoregressive methods and prevailing LLM baselines.
Poster
Yunqi Gu · Ian Huang · Jihyeon Je · Guandao Yang · Leonidas Guibas
[ ExHall D ]
Abstract
3D graphics editing is a crucial component in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating the process is challenging because graphical editing requires performing different tasks, each requiring distinct skill sets. Recently, multi-modal foundation models have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and real-world editing complexity. In this work, we present BlenderGym, a benchmark designed to systematically evaluate foundational model systems for 3D graphics editing with tasks capturing the various aspects of 3D editing and fixed ground-truth for evaluation. We evaluate closed- and open-source VLMs with BlenderGym and observe that even the state-of-the-art VLMs struggle with tasks relatively easily for a novice Blender user. Enabled by BlenderGym, we study how inference scaling techniques impact graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through scaling, complementing recent insights on scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by …
Poster
Zhipeng Xu · De Cheng · XINYANG JIANG · Nannan Wang · Dongsheng Li · Xinbo Gao
[ ExHall D ]
Abstract
Single domain generalization (SDG) aims to learn a robust model, which could perform well on many unseen domains while there is only one single domain available for training. One of the promising directions for achieving single-domain generalization is to generate out-of-domain (OOD) training data through data augmentation or image generation. Given the rapid advancements in AI-generated content (AIGC), this paper is the first to propose leveraging powerful pre-trained text-to-image (T2I) foundation models to create the training data. However, manually designing textual prompts to generate images for all possible domains is often impractical, and some domain characteristics may be too abstract to describe with words. To address these challenges, we propose a novel Progressive Adversarial Prompt Tuning (PAPT) framework for pre-trained diffusion models. Instead of relying on static textual domains, our approach learns two sets of abstract prompts as conditions for the diffusion model: one that captures domain-invariant category information and another that models domain-specific styles. This adversarial learning mechanism enables the T2I model to generate images in various domain styles while preserving key categorical features. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art single-domain generalization approaches. Code is available in the supplementary materials.
Poster
Byung Hyun Lee · Sungjin Lim · Se Young Chun
[ ExHall D ]
Abstract
Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the …
Poster
Dohyun Kim · Sehwan Park · GeonHee Han · Seung Wook Kim · Paul Hongsuck Seo
[ ExHall D ]
Abstract
Diffusion models have emerged as a cornerstone of generative modeling, capable of producing high-quality images through a progressive denoising process. However, their remarkable performance comes with substantial computational costs, driven by large model sizes and the need for multiple sampling steps. Knowledge distillation, a popular approach for model compression, transfers knowledge from a complex teacher model to a simpler student model. While extensively studied for recognition tasks, its application to diffusion models—especially for generating unseen concepts absent from training images—remains relatively unexplored. In this work, we propose a novel approach called random conditioning, which pairs noised images with randomly chosen text conditions to enable efficient, image-free knowledge distillation. By leveraging random conditioning, we show that it is possible to generate unseen concepts not included in the training data. When applied to conditional diffusion model distillation, This method enables the student model to effectively explore the condition space, leading to notable performance gains. Our approach facilitates the resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and practical applications.
Poster
Reza Shirkavand · Peiran Yu · Shangqian Gao · Gowthami Somepalli · Tom Goldstein · Heng Huang
[ ExHall D ]
Abstract
Recent advances in diffusion generative models have yielded remarkable progress. While the quality of generated content continues to improve, these models have grown considerably in size and complexity. This increasing computational burden poses significant challenges, particularly in resource-constrained deployment scenarios such as mobile devices. The combination of model pruning and knowledge distillation has emerged as a promising solution to reduce computational demands while preserving generation quality. However, this technique inadvertently propogates undesirable behaviors, including the generation of copyrighted content and unsafe concepts, even when such instances are absent from the fine-tuning dataset.In this paper, we propose a novel bilevel optimization framework for pruned diffusion models that consolidates the fine-tuning and unlearning processes into a unified phase. Our approach maintains the principal advantages of distillation—namely, efficient convergence and style transfer capabilities—while selectively suppressing the generation of unwanted content. This plug-in framework is compatible with various pruning and concept unlearning methods, facilitating efficient, safe deployment of diffusion models in controlled environments.
Poster
Jisu Nam · Soowon Son · Zhan Xu · Jing Shi · Difan Liu · Feng Liu · Seungryong Kim · Yang Zhou
[ ExHall D ]
Abstract
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K—a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks. The code and pre-trained weights will be publicly …
Poster
Alexandra Gomez-Villa · Kai Wang · C.Alejandro Parraga · Bartłomiej Twardowski · Jesus Malo · Javier Vazquez-Corral · Joost van de Weijer
[ ExHall D ]
Abstract
Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions.This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.
Poster
Zhenguang Liu · Chao Shuai · Shaojing Fan · Ziping Dong · Jinwu Hu · Zhongjie Ba · Kui Ren
[ ExHall D ]
Abstract
Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.
Poster
Haoyu Chen · Yunqiao Yang · Nan Zhong · Kede Ma
[ ExHall D ]
Abstract
Hiding data in deep neural networks (DNNs) has achieved remarkable successes, including both discriminative and generative models. Yet, the potential for hiding images in diffusion models remains underdeveloped. Existing approaches fall short in extracting fidelity, secrecy, and efficiency. In particular, the intensive computational demands of the hiding process, coupled with the slow extraction due to multiple denoising stages, make these methods impractical for resource-limited environments. To address these challenges, we propose hiding images at a specific denoising stage in diffusion models by modifying the learned score functions. We also introduce a parameter-efficient fine-tuning (PEFT) approach that combines parameter selection with a variant of low-rank adaptation (LoRA) to boost secrecy and hiding efficiency. Comprehensive experiments demonstrate the effectiveness of our proposed method.
Poster
Jan Dubiński · Antoni Kowalczuk · Franziska Boenisch · Adam Dziedzic
[ ExHall D ]
Abstract
Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data …
Poster
Fabrizio Guillaro · Giada Zingarini · Ben Usman · Avneesh Sud · Davide Cozzolino · Luisa Verdoliva
[ ExHall D ]
Abstract
Successful forensic detectors can produce excellent results in supervised learning benchmarks but struggle to transfer to real-world applications. We believe this limitation is largely due to inadequate training data quality. While most research focuses on developing new algorithms, less attention is given to training data selection, despite evidence that performance can be strongly impacted by spurious correlations such as content, format, or resolution. A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases. To this end, we propose a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. This ensures semantic alignment between real and fake images, allowing any differences to stem solely from the subtle artifacts introduced by AI generation. Through content-based augmentation, we show significant improvements in both generalization and robustness over state-of-the-art detectors and more calibrated results across 27 different generative models, including recent releases, like FLUX and Stable Diffusion 3.5. Our findings emphasize the importance of a careful dataset curation, highlighting the need for further research in dataset design. Code and data will be publicly available.
Poster
Antonio Andrea Gargiulo · Donato Crisostomi · Maria Sofia Bucarelli · Simone Scardapane · Fabrizio Silvestri · Emanuele Rodolà
[ ExHall D ]
Abstract
Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural information and is susceptible to task interference. In this paper, we study task vectors at the layer level, focusing on task layer matrices and their singular value decomposition. In particular, we concentrate on the resulting singular vectors, which we refer to as Task Singular Vectors (TSV). Recognizing that layer task matrices are often low-rank, we propose TSV-Compress, a simple procedure that compresses them to 10\% of their original size while retaining 99\% of accuracy. We further leverage this low-rank space to define a new measure of task interference based on the interaction of singular vectors from different tasks. Building on these findings, we introduce TSV-Merge, a novel model merging approach that combines compression with interference reduction, significantly outperforming existing methods.
Poster
Dimitrios Karageorgiou · Symeon Papadopoulos · Ioannis Kompatsiaris · Efstratios Gavves
[ ExHall D ]
Abstract
Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations. We will publicly share our code and data upon acceptance.
Poster
Jaewoo Song · Daemin Park · Kanghyun Baek · Sangyub Lee · Jooyoung Choi · Eunji Kim · Sungroh Yoon
[ ExHall D ]
Abstract
Developing effective visual inspection models remains challenging due to the scarcity of defect data, especially in new or low-defect-rate manufacturing processes. While recent approaches have attempted to generate defect images using image generation models, producing highly realistic defects remains difficult. In this paper, we propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. DefectFill leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions that incorporate defect, object, and cross-attention terms. This approach enables the inpainting diffusion model to precisely capture detailed, localized defect features and seamlessly blend them into defect-free objects. Additionally, we introduce the Low-Fidelity Selection method to further enhance the quality of the generated defect samples. Experiments demonstrate that DefectFill can generate high-quality defect images, and visual inspection models trained on these images achieve state-of-the-art performance on the MVTec AD dataset.
Poster
Alexander Gielisse · Jan van Gemert
[ ExHall D ]
Abstract
Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose classification challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly underperforms compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art from 38.8% to 60.1%. Moreover, we demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.94% while, to our knowledge, no other SIREN classification approach has managed to set a baseline for high-resolution images. We will make all code and results available.
Poster
Nathan Mankovich · Ignacio Santamaria · Gustau Camps-Valls · Tolga Birdal
[ ExHall D ]
Abstract
Flag manifolds encode hierarchical nested sequences of subspaces and serve as powerful structures for various computer vision and machine learning applications. Despite their utility in tasks such as dimensionality reduction, motion averaging, and subspace clustering, current applications are often restricted to extracting flags using common matrix decomposition methods like the singular value decomposition. Here, we address the need for a general algorithm to factorize and work with hierarchical datasets. In particular, we propose a novel, flag-based method that decomposes arbitrary hierarchical real-valued data into a hierarchy-preserving flag representation in Stiefel coordinates. Our work harnesses the potential of flag manifolds in applications including denoising, clustering, and few-shot learning.
Poster
Yiwei Bao · Zhiming Wang · Feng Lu
[ ExHall D ]
Abstract
Thanks to the introduction of large-scale datasets, deep-learning has become the mainstream approach for appearance-based gaze estimation problems. However, current large-scale datasets contain annotation errors and provide only a single vector for gaze annotation, lacking key information such as 3D eyeball structures. Limitations in annotation accuracy and variety have constrained the progress in research and development of deep-learning methods for appearance-based gaze-related tasks. In this paper, we present GazeGene, a new large-scale synthetic gaze dataset with photo-realistic samples. More importantly, GazeGene not only provides accurate gaze annotations, but also offers 3D annotations of vital eye structures such as the pupil, iris, eyeball, optic and visual axes for the first time. Experiments show that GazeGene achieves comparable quality and generalization ability with real-world datasets, even outperforms most existing datasets on high-resolution images. Furthermore, its 3D eyeball annotations expand the application of deep-learning methods on various gaze-related tasks, offering new insights into this field.
Poster
Daosong Hu · Mingyue Cui · Kai Huang
[ ExHall D ]
Abstract
Gaze direction serves as a pivotal indicator for assessing the level of driver attention. While image-based gaze estimation has been extensively researched, there has been a recent shift towards capturing gaze direction from video sequences. This approach encounters notable challenges, including the comprehension of the dynamic pupil evolution across frames and the extraction of head pose information from a relatively static background. To surmount these challenges, we introduce a dual-stream deep learning framework that explicitly models the displacement changes of the pupil through a fine-grained inter-frame attention mechanism and generates weights to adjust gaze embeddings. This technique transforms the face into a set of distinct patches and employs cross-attention to ascertain the correlation between pixel displacements in various patches and adjacent frames, thereby tracking spatial dynamics within the sequence. Our method is validated using two publicly available driver gaze datasets, and the results indicate that it achieves state-of-the-art performance or is on par with the best outcomes while reducing the parameters.
Poster
Ziyang Chen · Prem Seetharaman · Bryan Russell · Oriol Nieto · David Bourgin · Andrew Owens · Justin Salamon
[ ExHall D ]
Abstract
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce *MultiFoley*, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow).MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation.Through automated evaluations and human studies, we demonstrate that *MultiFoley* successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods.
Poster
Zeyue Tian · Zhaoyang Liu · Ruibin Yuan · Jiahao Pan · Qifeng Liu · Xu Tan · Qifeng Chen · Wei Xue · Yike Guo
[ ExHall D ]
Abstract
We present VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that seamlessly match the video content through Long-Short-Term modeling. Furthermore, we present a large-scale dataset comprising 360K video-music pairs encompassing various genres such as movie trailers, advertisements, and documentaries. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment.
Poster
Edson Araujo · Andrew Rouditchenko · Yuan Gong · Saurabhchand Bhati · Samuel Thomas · Brian Kingsbury · Leonid Karlinsky · Rogerio Feris · James Glass · Hilde Kuehne
[ ExHall D ]
Abstract
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames.Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
Poster
Henghui Du · Guangyao Li · Chang Zhou · Chunjie Zhang · Alan Zhao · Di Hu
[ ExHall D ]
Abstract
In recent years, numerous tasks have been proposed to encourage model to develop specified capability in understanding audio-visual scene, primarily categorized into temporal localization, spatial localization, spatio-temporal reasoning, and pixel-level understanding. Instead, human possesses a unified understanding ability for diversified tasks. Therefore, designing an audio-visual model with general capability to unify these tasks is of great value. However, simply joint training for all tasks can lead to interference due to the heterogeneity of audiovisual data and complex relationship among tasks. We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. Specifically, considering the labels of existing datasets are simple words, we carefully refine these datasets and construct an Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process (AV-UIE), which clarifies the cooperative relationship among tasks. Subsequently, to facilitate concrete cooperation in learning stage, an interaction-aware LoRA structure with multiple LoRA heads is designed to learn different aspects of audiovisual data interaction. By unifying the explicit cooperation across the data and model aspect, our method not only surpasses existing unified audio-visual model on multiple tasks, …
Poster
Stefan Smeu · Dragos-Alexandru Boldisor · Dan Oneata · Elisabeta Oneata
[ ExHall D ]
Abstract
Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection---the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
Poster
Qiyao Xue · Xiangyu Yin · Boyuan Yang · Wei Gao
[ ExHall D ]
Abstract
Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
Poster
Tianhao Qi · Jianlong Yuan · Wanquan Feng · Shancheng Fang · Jiawei Liu · SiYu Zhou · Qian HE · Hongtao Xie · Yongdong Zhang
[ ExHall D ]
Abstract
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation.However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored.To address this gap, we introduce Mask2DiT, a novel approach that ensures fine-grained, one-to-one alignment between video segments and their corresponding text annotations.Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture.This mask ensures that each text annotation is applied exclusively to its corresponding video segment, while preserving temporal coherence across all visual tokens.With this attention mask facilitating fine-grained, segment-level textual-to-visual alignment, we adapt the DiT architecture for video generation tasks involving a fixed number of scenes.To further equip the DiT architecture with the capability for generating videos with additional scenes, we incorporate a segment-level conditional mask that treats preceding video segments as context for the final segment, thereby enabling auto-regressive scene extension.Both qualitative and quantitative experiments confirm that Mask2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
Poster
Yijie Xu · Bolun Zheng · Wei Zhu · Hangjia Pan · Yuchen Yao · Ning Xu · An-An Liu · Quan Zhang · Chenggang Yan
[ ExHall D ]
Abstract
Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In this paper, with exploring YouTube’s multilingual and multi-modal content, we construct a new social media temporal popularity prediction benchmark, namely SMTPD, and suggest a baseline framework for temporal popularity prediction. Through data analysis and experiments, we verify that temporal alignment and early popularity play crucial roles in social media popularity prediction for not only deepening the understanding of temporal dynamics of popularity in social media but also offering a suggestion about developing more effective prediction models in thisfield.
Poster
Hui Han · Siyuan Li · Jiaqi Chen · Yiwen Yuan · Yuling Wu · Yufan Deng · Chak Tou Leong · Hanwen Du · Junchen Fu · Youhua Li · Jie Zhang · Chi Zhang · Li-jia Li · Yongxin Ni
[ ExHall D ]
Abstract
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos while aligning with human expectations. Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency.To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces HA-Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. This benchmark represents the first attempt to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, HA-Video-Bench provides a structured, scalable approach to generated video evaluation. Experimental results demonstrate that MLLMs achieve superior alignment with human preferences across all dimensions. Moreover, in instances where our framework’s assessments diverge from human evaluations, it consistently offers more objective and accurate insights, suggesting an even greater potential advantage over traditional human judgment.
Poster
Wang Jiarui · Huiyu Duan · Guangtao Zhai · Juntong Wang · Xiongkuo Min
[ ExHall D ]
Abstract
The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions. The dataset and code will be released upon publication.
Poster
Niu Lian · Jun Li · Jinpeng Wang · Ruisheng Luo · Yaowei Wang · Shu-Tao Xia · Bin Chen
[ ExHall D ]
Abstract
Self-Supervised Video Hashing compresses videos into hash codes for efficient indexing and retrieval by learning meaningful video representations without the need for labeled data. State-of-the-art video hashing methods typically rely on random frame sampling, treating all frames equally. This approach leads to suboptimal hash codes by ignoring frame-specific information density and reconstruction difficulty. To address this limitation, we propose AutoSSVH, a method combining adversarial hard frame mining with hash contrastive learning based on Component Voting. Our adversarial sampling strategy automatically identifies and selects frames with higher reconstruction difficulty, discarding easily reconstructable frames to enhance training rigor and encoding capability. Additionally, we leverage Component Voting in hash contrastive learning, using class-specific anchors and the P2Set paradigm to effectively capture neighborhood information and complex inter-video relationships. Extensive experiments demonstrate that AutoSSVH surpasses existing methods in both accuracy and efficiency. Code and configurations will be released publicly.
Poster
Orr Zohar · Xiaohan Wang · Yann Dubois · Nikhil Mehta · Tong Xiao · Philippe Hansen-Estruch · Licheng Yu · Xiaofang Wang · Felix Juefei-Xu · Ning Zhang · Serena Yeung · Xide Xia
[ ExHall D ]
Abstract
Despite the rapid integration of video perception capabilities into Large Multi-modal Models (LMMs), what drives their video perception remains poorly understood.Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that uncovers what effectively drives video understanding in LMMs.We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover *Scaling Consistency*, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more.Guided by these findings, we introduce **Apollo**, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models process over 1-hour videos efficiently, with the 3B parameter variant outperforming most existing 7B models. **Apollo**-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME. Our code and models will be made available at publication.
Poster
Junbo Niu · Yifei Li · Ziyang Miao · Chunjiang Ge · Zhou Yuanhang · Qihao He · Xiaoyi Dong · Haodong Duan · Shuangrui Ding · Rui Qian · Pan Zhang · Yuhang Zang · Yuhang Cao · Conghui He · Jiaqi Wang
[ ExHall D ]
Abstract
Integrating past information and adapting to continuous video input are pivotal for human-level video understanding. Current benchmarks, however, focus on coarse-grained, video-level question-answering in offline settings, limiting real-time processing and adaptability for practical applications.To this end, we introduce \textbf{OVBench} (\textbf{O}nline-\textbf{V}ideo-\textbf{Bench}mark), which assesses online video understanding through three modes: (1) \textbf{Backward Tracing}, (2) \textbf{Real-Time Visual Perception}, and (3) \textbf{Forward Active Responding}. OVBench consists of 12 tasks, comprising about 2,800 meta-annotations with fine-grained, event-level timestamps paired with 858 videos across 10 domains,encompassing egocentric activities, virtual gaming worlds, and cinematic scenes. To minimize bias, we employ automated generation pipelines and human annotation for meticulous curation.We design an effective problem generation and evaluation pipeline based on these high-quality samples and densely query Video-LLMs across the video streaming timeline. Extensive evaluations of nine Video-LLMs reveal that despite rapid advancements and improving performance on traditional benchmarks, existing models struggle with online video understanding. Our comprehensive evaluation reveals that the best-performing models still have a significant gap compared to human agents in online video understanding.We anticipate that OVBench will guide the development of Video-LLMs towards practical real-world applications and inspire future research in online video understanding.
Poster
Darshana Saravanan · Varun Gupta · Darshan Singh S · Zeeshan Khan · Vineet Gandhi · Makarand Tapaswi
[ ExHall D ]
Abstract
A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision/video models and a move towards long-video understanding. While exciting, we take a step back and ask: are today’s models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (42.5%) and GPT-4o (44.3%), are far from human accuracy at 89.6%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.
Poster
Yuxuan Wang · Yueqian Wang · Bo Chen · Tong Wu · Dongyan Zhao · Zilong Zheng
[ ExHall D ]
Abstract
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.
Poster
Ziyu Ma · Chenhui Gou · Hengcan Shi · Bin Sun · Shutao Li · Hamid Rezatofighi · Jianfei Cai
[ ExHall D ]
Abstract
Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).
Poster
Lucas Ventura · Antoine Yang · Cordelia Schmid · Gul Varol
[ ExHall D ]
Abstract
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our "Chapter-Llama" framework. Specifically, we leverage a pre-trained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcripts and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 18.7\% F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models.
Poster
Tiantian Geng · Jinrui Zhang · Qingni Wang · Teng Wang · Jinming Duan · Feng Zheng
[ ExHall D ]
Abstract
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
Poster
Yuqian Yuan · Hang Zhang · Wentong Li · Zesen Cheng · Boqiang Zhang · Long Li · Xin Li · Deli Zhao · Wenqiao Zhang · Yueting Zhuang · Jianke Zhu · Lidong Bing
[ ExHall D ]
Abstract
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities. Our codes, models, and dataset will be made publicly available.
Poster
Min Jung Lee · Dayoung Gong · Minsu Cho
[ ExHall D ]
Abstract
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using an image caption model and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing …
Poster
Keda Tao · Can Qin · Haoxuan You · Yang Sui · Huan Wang
[ ExHall D ]
Abstract
Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5× inference speedup, 1.4× memory reduction against the baseline VLLM, while still improving the performance, with no training.
Poster
Chirag Parikh · Deepti Rawat · Rakshitha R. T. · Tathagata Ghosh · Ravi Kiran Sarvadevabhatla
[ ExHall D ]
Abstract
We introduce **RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives**. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning **14M frames** and **414K social comments**, resulting in **a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs**. We **evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose)** on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.
Poster
Tanveer Hannan · Md Mohaiminul Islam · Jindong Gu · Thomas Seidl · Gedas Bertasius
[ ExHall D ]
Abstract
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths—from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (e.g., +2.6\% R1@0.1 on MAD). The code is available in the supplementary and will be released.
Poster
Ali Athar · Xueqing Deng · Liang-Chieh Chen
[ ExHall D ]
Abstract
Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. All annotations, as well as the code and model weights will be made public.
Poster
Shehan Munasinghe · Hanan Gani · Wenqi Zhu · Jiale Cao · Eric P. Xing · Fahad Shahbaz Khan · Salman Khan
[ ExHall D ]
Abstract
Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V→L and L→V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
Poster
Yanjun Li · Zhaoyang Li · Honghui Chen · li'Zhi Xu
[ ExHall D ]
Abstract
Video Scene Graph Generation (VidSGG) aims to capture dynamic relationships among entities by sequentially analyzing video frames and integrating visual and semantic information. However, VidSGG is challenged by significant biases that skew predictions. To mitigate these biases, we propose a \textbf{VI}sual and \textbf{S}emantic \textbf{A}wareness (VISA) framework for unbiased VidSGG. VISA addresses visual bias through an innovative memory update mechanism that enhances object representations and concurrently reduces semantic bias by iteratively integrating object features with comprehensive semantic information derived from triplet relationships. This visual-semantics dual debiasing approach results in more unbiased representations of complex scene dynamics. Extensive experiments demonstrate the effectiveness of our method, where VISA outperforms existing unbiased VidSGG approaches by a substantial margin (e.g., +13.1\% improvement in mR@20 and mR@50 for the SGCLS task under Semi Constraint).
Poster
Hang Yin · Xiuwei Xu · Linqing Zhao · Ziwei Wang · Jie Zhou · Jiwen Lu
[ ExHall D ]
Abstract
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning.Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages.Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot …
Poster
Feiyu Pan · Hao Fang · Fangkai Li · Yanyu Xu · Yawei Li · Luca Benini · Xiankai Lu
[ ExHall D ]
Abstract
Referring video object segmentation (RVOS) aims to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select" paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a light adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm reuses the visual-language alignment of VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by bridging the gap with complex activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining remarkably low computational costs. Code is available at https://github.com/anonymous61888/SSA.
Poster
Yunxiang Fu · Meng Lou · Yizhou Yu
[ ExHall D ]
Abstract
High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce LAMSeg, a novel linear-time model comprising a hybrid feature encoder dubbed LAMNet, and a decoder based on state space models. Specifically, LAMNet synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate LAMSeg on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, LAMSeg-B achieves 52.1\% mIoU on ADE20K, outperforming SegNeXt-L by 1.1\% mIoU while reducing computational complexity by over 20 GFLOPs. On Cityscapes, LAMSeg-B attains 83.8\% mIoU, surpassing SegFormer-B3 by 2.1\% mIoU with approximately half the GFLOPs. Similarly, LAMSeg-B improves upon VWFormer-B3 by 0.9\% mIoU with lower GFLOPs on COCO-Stuff dataset.
Poster
Farzad Beizaee · Gregory A. Lodygensky · Christian Desrosiers · Jose Dolz
[ ExHall D ]
Abstract
The advanced image generation capabilities of recent diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining pixel-level structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions in an image while preserving normal ones. This leads to potential degradation of normal regions during the reconstruction process, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies.Our proposed Deviation correction diffusion (DeCo-Diff) model preserves the normal regions and encourages transformations exclusively on anomalous areas. By modeling anomalies as noise in the latent space, our method leverages the learned distribution of normal images to accurately reconstruct normal regions while altering only the anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well-known anomaly detection datasets. …
Poster
Zhenghao Xing · Hao Chen · Binzhu Xie · Jiaqi Xu · Ziyu Guo · Xuemiao Xu · Jianye Hao · Chi-Wing Fu · Xiaowei Hu · Pheng-Ann Heng
[ ExHall D ]
Abstract
Traffic Anomaly Understanding (TAU) is essential for improving public safety and transportation efficiency by enabling timely detection and response to incidents. Beyond existing methods, which rely largely on visual data, we propose to consider audio cues, a valuable source that offers strong hints to anomaly scenarios such as crashes and honking. Our contributions are twofold. First, we compile AV-TAU, the first large-scale audio-visual dataset for TAU, providing 29,865 traffic anomaly videos and 149,325 Q&A pairs, while supporting five essential TAU tasks. Second, we develop EchoTraffic, a multimodal LLM that integrates audio and visual data for TAU, through our audio-insight frame selector and dynamic connector to effectively extract crucial audio cues for anomaly understanding with a two-phase training framework. Experimental results on AV-TAU manifest that EchoTraffic sets a new SOTA performance in TAU, outperforming the existing multimodal LLMs. Our contributions, including AV-TAU and EchoTraffic, pave a new direction for multimodal TAU. Our dataset and code will be publicly available upon publication of this work.
Poster
Han Hu · Wenli Du · Peng Liao · Bing Wang · Siyuan Fan
[ ExHall D ]
Abstract
Due to the interference of background noise, existing video anomaly detection methods are prone to detect some normal events in complex scenes as anomalies. Meanwhile, we note that the diversity of normal patterns has not been adequately considered, i.e., the normal events that are worthy of reference in the test data have not been properly utilized, which raises the risk of missed and false detections. In this work, we combine the tasks of next-frame prediction and predicted-frame reconstruction to propose a noise-resistant video anomaly detection method. For the prediction task, we develop an RGB Error-Guided Multiscale Predictive Coding (EG-MPC) framework to overcome the interference of background noise on the learning of appearance and motion features of objects at various scales, thus achieving high-quality frame prediction. For the reconstruction task, we introduce the Dynamic Memory Modules (DMMs) into the reconstruction network, and design sparse aggregation and selective update strategies for the memory items in the DMMs to effectively represent diverse normal patterns while increasing the difficulty of reconstructing anomalies, thus making it easier to distinguish between normal and abnormal frames. Extensive experiments on four benchmark datasets demonstrate that our proposed method outperforms state-of-the-art approaches, especially in complex scenarios.
Poster
Yuhan Shen · Ehsan Elhamifar
[ ExHall D ]
Abstract
We introduce and develop a framework for Multi-Task Temporal Action Segmentation (MT-TAS), a novel paradigm that addresses the challenges of interleaved actions when performing multiple tasks simultaneously. Traditional action segmentation models, trained on single-task videos, struggle to handle task switches and complex scenes inherent in multi-task scenarios. To overcome these challenges, our MT-TAS approach synthesizes multi-task video data from single-task sources using our Multi-task Sequence Blending and Segment Boundary Learning modules. Additionally, we propose to dynamically isolate foreground and background elements within video frames, addressing the intricacies of object layouts in multi-task scenarios and enabling a new two-stage temporal action segmentation framework with Foreground-Aware Action Refinement. Also, we introduce the Multi-task Egocentric Kitchen Activities (MEKA) dataset, containing 12 hours of egocentric multi-task videos, to rigorously benchmark MT-TAS models. Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments.
Poster
Mengmeng Wang · Zeyi Huang · Xiangjie Kong · Guojiang Shen · Guang Dai · Jingdong Wang · Yong Liu
[ ExHall D ]
Abstract
Video action recognition involves interpreting both global context and specific details to accurately identify actions. While previous models are effective at capturing spatiotemporal features, they often lack a focused representation of key action details. To address this, we introduce FocusVideo, a framework designed for refining video action recognition through integrated global and local feature learning. Inspired by human visual cognition theory, our approach balances the focus on both broad contextual changes and action-specific details, minimizing the influence of irrelevant background noise. We first employ learnable action queries to selectively emphasize action-relevant regions without requiring region-specific labels. Next, these queries are learned by a local action streaming branch that enables progressive query propagation. Moreover, we introduce a parameter-free feature interaction mechanism for effective multi-scale interaction between global and local features with minimal additional overhead. Extensive experiments demonstrate that FocusVideo achieves state-of-the-art performance across multiple action recognition datasets, validating its effectiveness and robustness in handling action-relevant details.
Poster
Ziyu Yao · Xuxin Cheng · Zhiqi Huang · Lei Li
[ ExHall D ]
Abstract
Repetitive action counting, which aims to count periodic movements in a video, is valuable for video analysis applications such as fitness monitoring. However, existing methods largely rely on handcrafted models with limited representational capacity, which hampers their ability to accurately capture variable periodic patterns. Additionally, their supervised learning on narrow, limited training sets leads to overfitting and restricts their ability to generalize across diverse scenarios. To address these challenges, we propose CountLLM, the first large language model (LLM)-based framework that takes video data and periodic text prompts as inputs and outputs the desired counting value. CountLLM leverages the rich clues from explicit textual instructions and the powerful representational capabilities of pre-trained LLMs for repetitive action counting. To effectively guide CountLLM, we develop a periodicity-based structured template for instructions that describes the properties of periodicity and implements a standardized answer format to ensure consistency. Additionally, we propose a progressive multimodal training paradigm to enhance the periodicity-awareness of the LLM. Empirical evaluations on widely recognized benchmarks demonstrate CountLLM's superior performance and generalization, particularly in handling novel and out-of-domain actions that deviate significantly from the training data, offering a promising avenue for repetitive action counting.
Poster
Xiaoyan Ma · jidong kuang · Hongsong Wang · Jie Gui
[ ExHall D ]
Abstract
Skeleton-based human action recognition has received widespread attention in recent years due to its diverse range of application scenarios. Due to the different sources of human skeletons, skeleton data naturally exhibit heterogeneity. The previous works, however, overlook the heterogeneity of human skeletons and solely construct models tailored for homogeneous skeletons. This work addresses the challenge of heterogeneous skeleton-based action representation learning, specifically focusing on processing skeleton data that varies in joint dimensions and topological structures. The proposed framework comprises two primary components: heterogeneous skeleton processing and unified representation learning. The former first converts two-dimensional skeleton data into three-dimensional skeleton via an auxiliary network, and then constructs a prompted unified skeleton using skeleton-specific prompts. We also design an additional modality named semantic motion encoding to harness the semantic information within skeletons. The latter module learns a unified action representation using a shared backbone network that processes different heterogeneous skeletons, which have been processed by the former module. Extensive experiments on the NTU-60, NTU-120, and PKU-MMD II datasets demonstrate the effectiveness of our method in various tasks, such as action recognition, action retrieval and semi-supervised action recognition.
Poster
Xiaohai Li · Bineng Zhong · Qihua Liang · Zhiyi Mo · Jian Nong · Shuxiang Song
[ ExHall D ]
Abstract
The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency.Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker.DUTrack achieves new state-of-the-art performance on four mainstream vision-language …
Poster
Yu Guo · Weiquan Liu · Qingshan Xu · Shijun Zheng · Shujun Huang · Yu Zang · Siqi Shen · Chenglu Wen · Cheng Wang
[ ExHall D ]
Abstract
Adversarial Examples can mislead deep neural networks with subtle perturbations, causing them to make incorrect predictions. Notably, adversarial examples crafted for one model can also deceive other models, a phenomenon known as the transferability of adversarial examples. To improve transferability, existing research has designed various mechanisms centered on the complex interactions between data and models. However, their improvements are relatively limited. Moreover, since these methods are often designed for a specific data modality, this greatly restricts their scalability on other data modalities. In this work, we observe a mirroring relationship between model generalization and adversarial example transferability. Motivated by this observation, we propose an augmentation-based attack, called OPS (Operator-Perturbation-based Stochastic optimization), which constructs a stochastic optimization problem by input transformation operators and random perturbations, and solves this problem to generate adversarial examples with better transferability. Extensive experiments on both images and 3D point clouds demonstrate that OPS significantly outperforms existing SOTA methods in terms of both performance and cost, showcasing the universality and superiority of our approach.
Poster
Yuning Han · Bingyin Zhao · Rui Chu · Feng Luo · Biplab Sikdar · Yingjie Lao
[ ExHall D ]
Abstract
Recent studies show that diffusion models (DMs) are vulnerable to backdoor attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray box and eyeglasses) that contain evident patterns, rendering remarkable attack effects yet easy detection upon human inspection and defensive algorithms. While it is possible to improve stealthiness by reducing the strength of the backdoor, doing so can significantly compromise its generality and effectiveness. In this paper, we propose UIBDiffusion, the universal imperceptible backdoor attack for diffusion models, which allows us to achieve superior attack and generation performance while evading state-of-the-art defenses. We propose a novel trigger generation approach based on universal adversarial perturbations (UAPs) and reveal that such perturbations, which are initially devised for fooling pre-trained discriminative models, can be adapted as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on multiple types of DMs with different kinds of samplers across various datasets and targets. Experimental results demonstrate that UIBDiffusion brings three advantages: 1) Universality, the imperceptible trigger is universal (i.e., image and model agnostic) where a single trigger is effective to any images and all diffusion models with different samplers; 2) Utility, it achieves comparable generation quality (e.g., FID) and even better attack success rate (i.e., ASR) …
Poster
Wei Ao · Vishnu Naresh Boddeti
[ ExHall D ]
Abstract
Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly concerning unauthorized access to sensitive biometric data. This paper presents CryptoFace, the first end-to-end encrypted face recognition system that ensures secure processing of facial data from acquisition through storage and matching without exposing raw facial images or features at any stage. CryptoFace leverages fully homomorphic encryption (FHE) for encrypted feature extraction, feature matching, and comparison while maintaining high face recognition performance. It employs a mixture of shallow patch convolutional networks (PCNNs), which can be evaluated in parallel with FHE and lead to much faster inference. It is scalable to high-resolution face data without sacrificing inference speed and optimizes homomorphic neural architecture by minimizing the multiplicative depth. We evaluate the performance and computational efficiency of CryptoFace on multiple encrypted benchmark face datasets. CryptoFace exhibits a significant acceleration of 7.2× (9,845s to 1,364s per image on CPU) compared to state-of-the-art FHE-based neural networks adapted for face recognition. CryptoFace will facilitate the deployment of secure biometric authentication systems in applications requiring strict privacy and security guarantees.
Poster
Xinjie Cui · Yuezun Li · Ao Luo · Jiaran Zhou · Junyu Dong
[ ExHall D ]
Abstract
We describe the Forensics Adapter, an adapter network designed to transform CLIP into an effective and generalizable face forgery detector. Although CLIP is highly versatile, adapting it for face forgery detection is non-trivial as forgery-related knowledge is entangled with a wide range of unrelated knowledge. Existing methods treat CLIP merely as a feature extractor, lacking task-specific adaptation, which limits their effectiveness. To address this, we introduce an adapter to learn face forgery traces -- the blending boundaries unique to forged faces, guided by task-specific objectives. Then we enhance the CLIP visual tokens with a dedicated interaction strategy that communicates knowledge across CLIP and the adapter. Since the adapter is alongside CLIP, its versatility is highly retained, naturally ensuring strong generalizability in face forgery detection. With only 5.7M trainable parameters, our method achieves a significant performance boost, improving by approximately 7\% on average across five standard datasets. We believe the proposed method can serve as a baseline for future CLIP-based face forgery detection methods.
Poster
Haoran Wang · Xinji Mai · Zeng Tao · Xuan Tong · Junxiong Lin · Yan Wang · Jiawen Yu · Shaoqi Yan · Ziheng Zhou · Wenqiang Zhang
[ ExHall D ]
Abstract
The current advancements in Dynamic Facial Expression Recognition (DFER) methods mainly focus on better capturing the spatial and temporal features of facial expressions. However, DFER datasets contain a substantial amount of noisy samples, and few have addressed the issue of handling this noise. We identified two types of noise: one is caused by low-quality data resulting from factors such as occlusion, dim lighting, and blurriness; the other arises from mislabeled data due to annotation bias by annotators. Addressing the two types of noise, we have meticulously crafted a \textbf{D}ynamic \textbf{D}ual-\textbf{S}tage \textbf{P}urification (D2SP) Framework. This initiative aims to dynamically purify the DFER datasets of these two types of noise, ensuring that only high-quality and correctly labeled data is used in the training process. To mitigate low-quality samples, we introduce the Coarse-Grained Pruning (CGP) stage, which computes sample weights and prunes those low-weight samples. After CGP, the Fine-Grained Correction (FGC) stage evaluates prediction stability to correct mislabeled data. Moreover, D2SP is conceived as a general and plug-and-play framework, tailored to integrate seamlessly with prevailing DFER methods. Extensive experiments covering prevalent DFER datasets and deploying multiple benchmark methods have substantiated D2SP’s ability to significantly enhance performance metrics.
Poster
Tianyi Wang · Zichen Wang · Cong Wang · Yuanchao Shu · Ruilong Deng · Peng Cheng · Jiming Chen
[ ExHall D ]
Abstract
Object detection is a fundamental enabler for many real-time downstream applications such as autonomous driving, augmented reality and supply chain management. However, the algorithmic backbone of neural networks is brittle to imperceptible perturbations in the system inputs, which were generally known as misclassifying attacks. By targeting the real-time processing capability, a new class of latency attacks are reported recently. They exploit new attack surfaces in object detectors by creating a computational bottleneck in the post-processing module, that leads to cascading failure and puts the real-time downstream tasks at risks. In this work, we take an initial attempt to defend against this attack via background-attentive adversarial training that is also cognizant of the underlying hardware capabilities. We first draw system-level connections between latency attack and hardware capacity across heterogeneous devices. Based on the particular adversarial behaviors, we utilize objectness loss as a proxy and build background attention into the adversarial training pipeline, and achieve a reasonable balance between clean and robust accuracy. The extensive experiments demonstrate the defense effectiveness of restoring real-time processing capability from 13 FPS to 43 FPS on Jetson Orin NX, with a better trade-off between the clean and robust accuracy.
Poster
Wei Huang · Qinying Gu · Nanyang Ye
[ ExHall D ]
Abstract
Offline reinforcement learning (RL) enables policy training solely on pre-collected data, avoiding direct environment interaction—a crucial benefit for energy-constrained embodied AI applications. Although Artificial Neural Networks (ANN)-based methods perform well in offline RL, their high computational and energy demands motivate exploration of more efficient alternatives. Spiking Neural Networks (SNNs) show promise for such tasks, given their low power consumption. In this work, we introduce DSFormer, the first spike-driven transformer model designed to tackle offline RL via sequence modeling. Unlike existing SNN transformers focused on spatial dimensions for vision tasks, we develop Temporal Spiking Self-Attention (TSSA) and Positional Spiking Self-Attention (PSSA) in DSFormer to capture the temporal and positional dependencies essential for sequence modeling in RL. Additionally, we propose Progressive Threshold-dependent Batch Normalization (PTBN), which combines the benefits of LayerNorm and BatchNorm to preserve temporal dependencies while maintaining the spiking nature of SNNs. Comprehensive results in the D4RL benchmark show DSFormer’s superiority over both SNN and ANN counterparts, achieving 78.4\% energy savings, highlighting DSFormer's advantages not only in energy efficiency but also in competitive performance.
Poster
Zhiqi Pang · Junjie Wang · Lingling Zhao · Chunyu Wang
[ ExHall D ]
Abstract
Clothing change person re-identification (CC-ReID) aims to match different images of the same person, even when the clothing varies across images. To reduce manual labeling costs, existing unsupervised CC-ReID methods employ clustering algorithms to generate pseudo-labels. However, they often fail to assign the same pseudo-label to two images with the same identity but different clothing—referred to as a clothing change positive pair—thus hindering clothing-invariant feature learning. To address this issue, we propose the identity-clothing similarity modeling (ICSM) framework. To effectively connect clothing change positive pairs, ICSM first performs clothing-aware learning to leverage all discriminative information, including clothing, to obtain compact clusters. It then extracts cluster-level identity and clothing features and performs inter-cluster similarity estimation to identify clothing change positive clusters, reliable negative clusters, and hard negative clusters for each compact cluster. During optimization, we design an adaptive version of existing optimization methods to enhance similarities of clothing change positive pairs, while also introducing text semantics as a supervisory signal to further promote clothing invariance. Extensive experimental results across multiple datasets validate the effectiveness of the proposed framework, demonstrating its superiority over existing unsupervised methods and its competitiveness with some supervised approaches.
Poster
Jinxi Yang · He Li · Bo Du · Mang Ye
[ ExHall D ]
Abstract
Person re-identification (ReID) is the task of matching individuals across different camera views. Existing approaches typically employ neural networks to extract discriminative features, ranking gallery images based on their similarities to probe images. While effective, these methods are often further enhanced through re-ranking, which refines the initial retrieval results without additional training. However, current re-ranking methods mostly rely on k-nearest neighbor search to extract similar images that might have the same identity as the query, which is time-consuming with a high computation burden, limiting their applications in reality. We rethink the effect of the k-nearest neighbor search and introduce Chebyshev's Theorem-guided Graph Re-ranking (Cheb-GR) method which adopts the adaptive neighbor search guided by Chebyshev's Theorem over the k-nearest neighbor search for efficient neighbor selection. Our method leverages graph convolution operations to refine image features and achieve robust re-ranking, leading to enhanced retrieval performance. Furthermore, we provide a theoretical analysis based on Chebyshev's Inequality to elucidate the factors contributing to the strong performance of the proposed method. Our method significantly reduces the computation costs while maintaining relatively strong performance. Through extensive experiments in both general and cross-domain settings, we demonstrate the effectiveness of Cheb-GR and its potential for real-world applications. Code …
Poster
Ji Du · Fangwei Hao · Mingyang Yu · Desheng Kong · Jiesheng Wu · Bin Wang · Jing XU · Ping Li
[ ExHall D ]
Abstract
Camouflaged Object Detection (COD) seeks to distinguish objects from their highly similar backgrounds. Existing work has essentially focused on isolating camouflaged objects from the environment, demonstrating ever-improving performance but at the cost of extensive annotations and complex optimizations. In this paper, we diverge from this paradigm and shift the lens to isolating the salient environment from the camouflaged object. We introduce EASE, an Environment-Aware unSupErvised COD framework that identifies the environment by referencing an environment prototype library and detects camouflaged objects by inverting the retrieved environmental features. Specifically, our approach (DiffPro) uses large multimodal models, diffusion models, and vision-foundation models to construct the environment prototype library. To retrieve environments from the library and refrain from confusing foreground and background, we incorporate three retrieval schemes: Kernel Density Estimation-based Adaptive Threshold (KDE-AT), Global-to-Local pixel-level retrieval (G2L), and Self-Retrieval (SR). Our experiments demonstrate significant improvements over current unsupervised methods, with EASE achieving an average gain of over 10\% on the COD10K dataset. When integrated with SAM, EASE surpasses prompt-based segmentation approaches and performs competitively with state-of-the-art fully-supervised methods.
Poster
Yi Yu · Botao Ren · Peiyuan Zhang · Mingxin Liu · Junwei Luo · Shaofeng Zhang · Feipeng Da · Junchi Yan · Xue Yang
[ ExHall D ]
Abstract
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. The source code will be made publicly available.
Poster
Hang Zhou · Xinxin Zuo · Rui Ma · Li Cheng
[ ExHall D ]
Abstract
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to minimize the need for dense supervision, which unfortunately may limit their ability to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been employed; yet their over-relaxed regularization often leads to imprecise placement. We propose BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our method first identifies regions of interest suitable for object placement by training a dedicated detection transformer on object-subtracted backgrounds with multi-object supervisions. It then associates each target compositing object with detected regions based on semantic complementary. Using a boostrapped training approach on randomly object-subtracted images, our model regularizes meaningful placements through richly paired data augmentation. Experimental results on standard benchmarks demonstrate BOOTPLACE's superior performance in object reposition, significantly outperforming state-fo-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
Poster
Fangyun Wei · Jinjing Zhao · Kun Yan · Chang Xu
[ ExHall D ]
Abstract
Traditional video instance segmentation (VIS) models rely on extensive per-frame video annotations, which are both time-consuming and costly. In this paper, we present MinMaxVIS, a novel VIS framework that reduces the dependency on fully labeled video datasets by utilizing a small set of labeled images from the target domain along with a large volume of general-domain, unlabeled images. MinMaxVIS operates in three stages: first, a preliminary segmentation model is trained on the small labeled set from the target domain; this model then retrieves relevant instances from the unlabeled dataset to build a high-quality pseudo-labeled set, ensuring a rich content alignment with the target domain while avoiding the inefficiencies of large-scale semi-supervised learning across the entire unlabeled set. Finally, we train MinMaxVIS on a combination of labeled and pseudo-labeled data, addressing challenges such as noise in pseudo-labels and instance association across frames. To simulate object continuity, we augment static images to create paired frames, allowing MinMaxVIS to capture instance associations effectively. MinMaxVIS outperforms the prior image-driven approach, MinVIS, achieving superior mAP scores with significantly reduced labeled data. For instance, MinMaxVIS with a Swin-L backbone attains 62.2 mAP on YouTube-VIS 2019 using only 2% labeled data and additional unlabeled images from SA-1B. …
Poster
Jiacheng Sun · Xinghong Zhou · Yiqiang Wu · Bin Zhu · Jiaxuan Lu · Yu Qin · Xiaomao Li
[ ExHall D ]
Abstract
One of the roadblocks for instance segmentation today is heavy computational overhead and model parameters. Previous methods based on Polar Representation made the initial mark to address this challenge by formulating instance segmentation as polygon detection, but failed to align with mainstream methods in performance. In this paper, we highlight that Representation Errors, arising from the limited capacity of polygons to capture boundary details, have long been overlooked, which results in severe performance degradation. Observing that optimal starting point selection effectively alleviates this issue, we propose an Adaptive Polygonal Sample Decision strategy to dynamically capture the positional variation of representation errors across samples. Additionally, we design a Union-aligned Rasterization Module to incorporate these errors into polygonal assessment, further advancing the proposed strategy. With these components, our framework called PolarNeXt achieves a remarkable performance boost of over 4.8% AP compared to other polar-based methods. PolarNeXt is markedly more lightweight and efficient than state-of-the-art instance segmentation methods, while achieving comparable segmentation accuracy. We expect this work will open up a new direction for instance segmentation in high-resolution images and resource-limited scenarios.
Poster
Jihuai Zhao · Junbao Zhuo · Jiansheng Chen · Huimin Ma
[ ExHall D ]
Abstract
In the field of zero-shot 3D instance segmentation, existing 2D-to-3D lifting methods typically obtain 2D segmentation across multiple RGB frames using vision foundation models, which are then projected and merged into 3D space. However, since the inference of vision foundation models on a single frame is not integrated with adjacent frames, the masks of the same object may vary across different frames, leading to a lack of view consistency in the 2D segmentation. Furthermore, current lifting methods average the 2D segmentation from multiple views during the projection into 3D space, causing low-quality masks and high-quality masks to share the same weight. These factors can lead to fragmented 3D segmentation. In this paper, we present SAM2Object, a novel zero-shot 3D instance segmentation method that effectively utilizes the Segment Anything Model 2 to segment and track objects, consolidating view consistency across frames. Our approach combines these consistent 2D masks with 3D geometric priors, improving the robustness of 3D segmentation. Additionally, we introduce mask consolidation module to filter out low-quality masks across frames, which enables more precise 3D-to-2D matching. Comprehensive evaluations on ScanNetV2, ScanNet++ and ScanNet200 demonstrate the robustness and effectiveness of SAM2Object, showcasing its ability to outperform previous methods.
Poster
Jiaxin Zhang · Junjun Jiang · Youyu Chen · Kui Jiang · Xianming Liu
[ ExHall D ]
Abstract
Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results demonstrate that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-train model, yielding clear boundaries while preserving high visual quality.
Poster
Bo-Wen Yin · Jiao-Long Cao · Ming-Ming Cheng · Qibin Hou
[ ExHall D ]
Abstract
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to leverage a memory token as the query to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that \nameofmethod{} exhibits exceptional performance in various RGBD semantic segmentation benchmarks.
Poster
Chongkai Yu · Ting Liu · Li Anqi · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Xiaolin Hu
[ ExHall D ]
Abstract
Interactive segmentation is to segment the mask of the target object according to the user’s interactive prompts. There are two mainstream strategies: early fusion and late fusion. Current specialist models utilize the early fusion strategy that encodes the combination of images and prompts to target the prompted objects, yet repetitive complex computations on the images result in high latency. Late fusion models extract image embeddings once and merge them with the prompts in later interactions. This strategy avoids redundant image feature extraction and improves efficiency significantly. A recent milestone is the Segment Anything Model (SAM). However, this strategy limits the models ability to extract detailed information from the prompted target zone. To address this issue, we propose SAM-REF, a two-stage refinement framework that fully integrates images and prompts by using a lightweight refiner into the interaction of late fusion, which combines the accuracy of early fusion and maintains the efficiency of late fusion. Through extensive experiments, we show that our SAM-REF model outperforms the current state-of-the-art method in most metrics on segmentation quality without compromising efficiency.
Poster
Subhransu S. Bhattacharjee · Dylan Campbell · Rahul Shome
[ ExHall D ]
Abstract
Can objects that are not visible in an image---but are in the vicinity of the camera---be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.
Poster
Ege Özsoy · Chantal Pellegrini · Tobias Czempiel · Felix Tristram · Kun yuan · David Bani-Harouni · Ulrich Eck · Benjamin Busam · Matthias Keicher · Nassir Navab
[ ExHall D ]
Abstract
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. We will publish all our code and dataset upon acceptance.
Poster
Jinlong Li · Cristiano Saltori · Fabio Poiesi · Nicu Sebe
[ ExHall D ]
Abstract
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models—such as CLIP, Dinov2, and Stable Diffusion—into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities.
Poster
Chenyangguang Zhang · Alexandros Delitzas · Fangjinhua Wang · Ruida Zhang · Xiangyang Ji · Marc Pollefeys · Francis Engelmann
[ ExHall D ]
Abstract
We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs.
Poster
Junsheng Wang · Nieqing Cao · Yan Ding · Mengying Xie · Fuqiang Gu · Chao Chen
[ ExHall D ]
Abstract
Generating layouts from textual descriptions by large language models (LLMs) plays a crucial role in precise spatial reasoning-induced domains such as robotic object rearrangement and text-to-image generation. However, current methods face challenges in limited real-world examples, handling diverse layout descriptions and varying levels of granularity. To address these issues, a novel framework named Spatial Knowledge Enhanced Layout (SKE-Layout), is introduced. SKE-Layout integrates mixed spatial knowledge sources, leveraging both real and synthetic data to enhance spatial contexts. It utilizes diverse representations tailored to specific tasks and employs contrastive learning and multitask learning techniques for accurate spatial knowledge retrieval. This framework generates more accurate and fine-grained visual layouts for object rearrangement and text-to-image generation tasks, achieving improvements of 5\%-30\% compared to existing methods.
Poster
Hsiang-Wei Huang · Fu-Chen Chen · Wenhao Chai · Che-Chun Su · Lu Xia · Sanghun Jung · Cheng-Yen Yang · Jenq-Neng Hwang · Min Sun · Cheng-Hao Kuo
[ ExHall D ]
Abstract
Recent advancements in 3D Large Multi-modal Models (3D-LMMs) have driven significant progress in 3D question answering. However, recent multi-frame Vision-Language Models (VLMs) demonstrate superior performance compared to 3D-LMMs on 3D question answering tasks, largely due to the greater scale and diversity of available 2D image data in contrast to the more limited 3D data. Multi-frame VLMs, although achieving superior performance, suffer from the difficulty of retaining all the detailed visual information in the 3D scene while limiting the number of visual tokens. Common methods such as token pooling, reduce visual token usage but often lead to information loss, impairing the model’s ability to preserve visual details essential for 3D question answering tasks. To address this, we propose voxel-based Dynamic Token Compression (DTC), which combines 3D spatial priors and visual semantics to achieve over 90% reduction in visual tokens usage for current multi-frame VLMs. Our method maintains performance comparable to state-of-the-art models on 3D question answering benchmarks including OpenEQA and ScanQA, demonstrating its effectiveness.
Poster
Zhihao Yuan · Yibo Peng · Jinke Ren · Yinghong Liao · Yatong Han · Chun-Mei Feng · Hengshuang Zhao · Guanbin Li · Shuguang Cui · Zhen Li
[ ExHall D ]
Abstract
Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.
Poster
Damiano Marsili · Rohun Agrawal · Yisong Yue · Georgia Gkioxari
[ ExHall D ]
Abstract
Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Recent progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To better assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference.We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks.
Poster
Ziyi Bai · Hanxuan Li · Bin Fu · Chuyan Xiong · Ruiping Wang · Xilin Chen
[ ExHall D ]
Abstract
This paper explores leveraging large language models (LLMs) as low-level action planners for general embodied instruction following tasks. LLMs excel at serving as the “brain” of robots by handling high-level task planning but lack the ability to directly generate precise low-level actions to guide the “body”. This limitation arises from a disconnect between high-level conceptual understanding and low-level spatial perception. We address this challenge by bridging the gap, enabling LLMs to not only understand complex instructions but also produce precise, actionable plans. To achieve this, we introduce Room to Chessboard (R2C), a semantic representation that maps environments onto a grid-based chessboard, enabling LLMs to generate specific low-level coordinates and effectively guide robots as if playing a game of chess. We further propose a Chain-of-Thought Decision (CoT-D) paradigm to enhance the LLMs’ decision-making ability by improving interpretability and context-awareness. By jointly training LLMs for high-level task decomposition and low-level action generation, we create a unified “brain-body” system capable of handling complex, free-form instructions while producing precise low-level actions to make robot adapt to dynamic environments in real time. We validate R2C using both fine-tuned open-source LLMs and GPT-4, demonstrating effectiveness on the challenging ALFRED benchmark. Results show that with our R2C …
Poster
Yi Fang · Bowen Jin · Jiacheng Shen · Sirui Ding · Qiaoyu Tan · Jiawei Han
[ ExHall D ]
Abstract
The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework.However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG).It is underexplored how MLLMs can incorporate the relational information (i.e., graph structure) and semantic information (i.e., texts and images) on such graphs for multimodal comprehension and generation.In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs.Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs.Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method.
Poster
Yuchen Sun · Shanhui Zhao · Tao Yu · Hao Wen · Samith Va · Mengwei Xu · Yuanchun Li · Chongyang Zhang
[ ExHall D ]
Abstract
GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10\% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
Poster
XiMing Xing · Juncheng Hu · Guotao Liang · Jing Zhang · Dong Xu · Qian Yu
[ ExHall D ]
Abstract
The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of scalable vector graphics (SVG) generation. While LLMs encode partial knowledge of SVG data from web pages during training, recent findings suggest that semantically ambiguous and tokenized representations within LLMs may result in hallucinations in vector primitive predictions. Furthermore, LLM training lacks modeling and understanding of the rendering sequence of vector paths, resulting in occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. LLM4SVG facilitates a deeper understanding of SVG components through learnable semantic tokens, precisely encoding these tokens and their corresponding properties to generate semantically aligned SVG output. Using a series of learnable semantic tokens, a structured dataset for instruction following is developed to support comprehension and generation across two primary tasks. Our method introduces a modular architecture to existing large language models (LLMs), integrating semantic tags, vector instruction encoders, fine-tuned commands, and powerful LLMs to tightly combine geometric, appearance, and language information. To overcome the scarcity of SVG-text instruction data, we developed …
Poster
Kevin Qinghong Lin · Linjie Li · Difei Gao · Zhengyuan Yang · Shiwei Wu · Zechen Bai · Stan Weixian Lei · Lijuan Wang · Mike Zheng Shou
[ ExHall D ]
Abstract
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents.In this work, we develop a vision-language-action model in the digital world, namely "Our Model," which features the following innovations:1. **UI-Guided Visual Token Selection** to reduce computational costs by formulating screenshots as a UI-connected graph, adaptively identifying their redundant relationships and serving as the criteria for token selection during self-attention blocks. 2. **Interleaved Vision-Language-Action Streaming** that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. 3. **Small-Scale High-Quality GUI Instruction-Following Datasets** by careful data curation and employing a resampling strategy to address significant data type imbalances. With the above components, our model, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up performance by 1.4×. Navigation experiments across web, mobile, and online environments further …
Poster
Xu Cao · Pranav Virupaksha · Wenqi Jia · Bolin Lai · Fiona Ryan · Sangmin Lee · James Rehg
[ ExHall D ]
Abstract
Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models' (VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field.
Poster
Jun Gao · Yongqi Li · Ziqiang Cao · Wenjie Li
[ ExHall D ]
Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer.However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image.In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill.Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs.ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency.ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs.We apply ADS to realize ICoT on two popular VLMs of different architectures.Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.
Poster
Guillaume Astruc · Nicolas Gonthier · Clement Mallet · Loic Landrieu
[ ExHall D ]
Abstract
Geospatial models must adapt to the diversity of Earth observation data in terms of resolutions, scales, and modalities. However, existing approaches expect fixed input configurations, which limits their practical applicability. We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and resolution-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics and 11 distinct sensors. We then train a single powerful model on these diverse datasets simultaneously. Once fine-tuned, we achieve better or near state-of-the-art results on the datasets of GeoPlex and 3 additional ones for 4 environment monitoring tasks: land cover mapping, crop type classification, change detection, and forest analysis. We will release all codes, models, and data.
Poster
Peijie Wang · Zhong-Zhi Li · Fei Yin · Dekang Ran · Cheng-Lin Liu
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs’ mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.
Poster
James Burgess · Jeffrey J Nirschl · Laura Bravo-Sánchez · Alejandro Lozano · Sanket Rajan Gupte · Jesus G. Galaz-Montoya · Yuhui Zhang · Yuchang Su · Disha Bhowmik · Zachary Coman · Sarina M. Hasan · Alexandra Johannesson · William D. Leineweber · Malvika G Nair · Ridhi Yarlagadda · Connor Zuraski · Wah Chiu · Sarah Cohen · Jan N. Hansen · Manuel D Leonetti · Chad Liu · Emma Lundberg · Serena Yeung
[ ExHall D ]
Abstract
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks target up to college level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,061 multiple-choice questions (MCQs) curated by biological experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. We find that standard MCQ creation methods do not properly test our targeted reasoning capabilities, motivating a new two stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' generates more challenging distractors. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 43%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought reasoning failures indicates that multimodal reasoning errors are frequent, followed by knowledge errors and overgeneralization. These insights …
Poster
Ashmal Vayani · Dinura Dissanayake · Hasindri Watawana · Noor Ahsan · Nevasini Sasikumar · Omkar Thawakar · Henok Biadglign Ademtew · Yahya Hmaiti · Amandeep Kumar · Kartik Kuckreja · Mykola Maslych · Wafa Al Ghallabi · Mihail Minkov Mihaylov · Chao Qin · Abdelrahman Shaker · Mike Zhang · Mahardika Krisna Ihsani · Amiel Gian Esplana · Monil Gokani · Shachar Mirkin · Harsh Singh · Ashay Srivastava · Endre Hamerlik · Fathinah Asma Izzati · Fadillah Adamsyah Maani · Sebastian Cavada · Jenny Chim · Rohit Gupta · Sanjay Manjunath · Kamila Zhumakhanova · Feno Heriniaina Rabevohitra · Azril Hafizi Amirudin · Muhammad Ridzuan · Daniya Najiha Abdul Kareem · Ketan Pravin More · Kunyang Li · Pramesh Shakya · Muhammad Saad · Amirpouya Ghasemaghaei · Amirbek Djanibekov · Dilshod Azizov · Branislava Jankovic · Naman Bhatia · Alvaro Cabrera Berobide · Johan Obando-Ceron · Olympiah Otieno · Fabian Farestam · Muztoba Rabbani · Sanoojan Baliah · Santosh Sanjeev · Abduragim Shtanchaev · Maheen Fatima · Thao Nguyen · Amrin Kareem · Toluwani Aremu · Nathan Augusto Zacarias Xavier · Amit Bhatkal · Hawau Olamide Toyin · Aman Chadha · Hisham Cholakkal · Rao Anwer · Michael Felsberg · Jorma Laaksonen · Thamar Solorio · Monojit Choudhury · Ivan Laptev · Mubarak Shah · Salman Khan · Fahad Shahbaz Khan
[ ExHall D ]
Abstract
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in multimodal research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including True/False, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights …
Poster
Ke Sun · Shen Chen · Taiping Yao · Ziyin Zhou · Jiayi Ji · Xiaoshuai Sun · Chia-Wen Lin · Rongrong Ji
[ ExHall D ]
Abstract
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks.
Poster
Zhicheng Wang · Zhiyu Pan · Zhan Peng · Jian Cheng · Liwen Xiao · Wei Jiang · Zhiguo Cao
[ ExHall D ]
Abstract
Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define the CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervisions, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves 30% error reduction in counting metics and a 10% improvement in localization accuracy. The surprising results shed lights on the significance of contextual attribute density for REC. Code will be available.
Poster
Wenlong Fang · Qiaofeng Wu · Jing Chen · Yun Xue
[ ExHall D ]
Abstract
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our …
Poster
Tianyu Huai · Jie Zhou · Xingjiao Wu · Qin Chen · Qingchun Bai · Zezhou · Liang He
[ ExHall D ]
Abstract
Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering. We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs.We introduce a Dual-Router MoE (RMoE) to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach. The codes and weights will be released on GitHub.
Poster
Fan Lu · Wei Wu · Kecheng Zheng · Shuailei Ma · Biao Gong · Jiawei Liu · Wei Zhai · Yang Cao · Yujun Shen · Zheng-Jun Zha
[ ExHall D ]
Abstract
Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs. We will release the code and the dataset to support the community.
Poster
Shuxian Li · Changhao He · XitingLiu · Joey Tianyi Zhou · Xi Peng · Peng Hu
[ ExHall D ]
Abstract
Composed Image Retrieval (CIR) enables editable image search by integrating a query pair—a reference image ref and a textual modification mod—to retrieve a target image tar that reflects the intended change. While existing CIR methods have shown promising performance using well-annotated triplets ⟨ref,mod,tar⟩, almost all of them implicitly assume these triplets are accurately associated with each other. In practice, however, this assumption is often violated due to the limited knowledge of annotators, inevitably leading to incorrect textual modifications and resulting in a practical yet less-touched problem: noisy triplet correspondence (NTC). To tackle this challenge, we propose a Task-oriented Modification Enhancement framework (TME) to learn robustly from noisy triplets, which comprises three key modules: Robust Fusion Query (RFQ), Pseudo Text Enhancement (PTE), and Task-Oriented Prompt (TOP). Specifically, to mitigate the adverse impact of noise, RFQ employs a sample selection strategy to divide the training triplets into clean and noisy sets, thus enhancing the reliability of the training data for robust learning. To further leverage the noisy data instead of discarding it, PTE unifies the triplet noise as an adapter mismatch problem, thereby adjusting mod to align with ref and tar in the mismatched triplet. Finally, TOP replaces …
Poster
Eric Xing · Pranavi Kolouju · Robert Pless · Abby Stylianou · Nathan Jacobs
[ ExHall D ]
Abstract
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on the CIRR and CIRCO datasets. Source code, model checkpoints, and our new datasets will be made available upon publication.
Poster
Qiang Zou · Shuli Cheng · Jiayi Chen
[ ExHall D ]
Abstract
Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://anonymous.4open.science/r/PromptHash-8ED3.
Poster
Sungyeon Kim · Xinliang Zhu · Xiaofan Lin · Muhammet Bastan · Douglas Gray · Suha Kwak
[ ExHall D ]
Abstract
Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results comparable to embedding-based methods while preserving efficiency.
Poster
Yingxin Lai · Cuijie Xu · Haitian Shi · Guoqing Yang · Xiaoning Li · Zhiming Luo · Shaozi Li
[ ExHall D ]
Abstract
The rapid development of generative models has significantly advanced the font generation. However, limited explorations have been devoted into the evaluation and interpretability of graphical fonts. Especially, existing quality assessment models can only provide basic visual capabilities, such as recognizing clarity and brightness, lacking in-depth explanation.To address these limitations, we firstly constructed a large-scale multimodal dataset comprising 135,000 font-text pairs named Diversity Font Dataset (DFD). This dataset includes a wide range of generated font types and annotations including language descriptions and quality assessments, providing a strong basis for training and evaluating font analysis models.Based on the dataset, we developed a Vision Language Model (VLM)-based Font-Agent with the aim of improving font quality assessment and offering interpretative question-answering capabilities. Alongside the original visual encoder in VLM, we integrated a Edge Aware Traces (EAT) Module to capture detailed edge information of font strokes and components. Furthermore, we introduce a Dynamic Direct Preference Optimization (D-DPO) strategy to facilitate efficient model fine-tuning. Experimental outcomes showcase that Font-Agent achieves state-of-the-art performance on the the established dataset. To further assess the generalization of our algorithm, we conducted evaluation on several public datasets. The results highlight the notable advantage of Font-Agent in both assessing the quality of …
Poster
Bin Wang · Fan Wu · Linke Ouyang · Zhuangcheng Gu · Rui Zhang · Renqiu Xia · Botian Shi · Bo Zhang · Conghui He
[ ExHall D ]
Abstract
Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
Poster
Arun Reddy · Alexander Martin · Eugene Yang · Andrew Yates · Kate Sanders · Kenton Murray · Reno Kriz · Celso M. de Melo · Benjamin Van Durme · Rama Chellappa
[ ExHall D ]
Abstract
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
Poster
Jingfeng Yao · Bin Yang · Xinggang Wang
[ ExHall D ]
Abstract
Latent diffusion models (LDM) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: increasing the per-token feature dimension in visual tokenizers improves reconstruction quality but requires substantially larger diffusion models and extended training time to maintain generation performance. This results in prohibitively high computational costs, making high-dimensional tokenizers impractical. In this paper, we argue that this limitation stems from the inherent difficulty of learning unconstrained high-dimensional latent spaces and address this limitation by aligning the latent space with pre-trained vision foundation models. Our VA-VAE (Vision foundation model Aligned Variational AutoEncoder) expands the Pareto frontier of visual tokenizers, enabling 2.7 times faster Diffusion Transformers (DiT) convergence in high-dimensional latent space. To further validate our approach, we optimize a DiT baseline, referred to as LightningDiT, achieving superior performance on class conditional generation with only 6% of the original training epochs. The integrated system demonstrates the effectiveness of VA-VAE, achieving 0.28 rFID and 1.73 gFID on ImageNet-256 generation in 400 epochs—outperforming the original DiT's 0.71 rFID and 2.27 gFID in 1400 epochs, without more complex designs. To our knowledge, this marks the first latent diffusion system to achieve both superior generation and reconstruction …
Poster
Leqi Shen · Guoqiang Gong · Tianxiang Hao · Tao He · Yifeng Zhang · Pengzhang Liu · Sicheng Zhao · Jungong Han · Guiguang Ding
[ ExHall D ]
Abstract
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 2.2% R@1 and 7.5% R@sum. Our code will be made available.
Poster
Siyuan Li · Luyuan Zhang · Zedong Wang · Juanxi Tian · Cheng Tan · Zicheng Liu · Chang Yu · Qingsong Xie · Haonan Lu · Haoqian Wang · Zhen Lei
[ ExHall D ]
Abstract
Masked Image Modeling (MIM) with Vector Quantization (VQ) have achieved great success in self-supervised pre-training and image generation. However, the most methods could not solve two technical challenges simultaneously: (1) The gradient approximation in classical VQ (e.g., VQGAN) sets a optimization bottleneck for the codebook and encoder; (2) Trade-off in the kept token lengths for generation quality vs. representation learning and efficiency. To unveil the full power of this learning paradigm, this paper introduces a unified framework that achieves both visual representation learning and generation in an auto-encoding way with token merging and vector quantization, called MergeVQ. As for pre-training, we merge the full spatial tokens into top-k semantic ones with the token merge module after self-attention blocks in the encoder and recovery the positions of the merged tokens with cross-attention blocks in the decoder. As for generation, we quantize the selected top-k tokens with Look-up Free Quantization (LFQ) with no training bottleneck and trade-off the computational overheads and qualities for generation. Extensive experiments on ImageNet verify that our MergeVQ achieves both performance gains and speeding up across image generation and self-supervised pre-training scenarios.
Poster
Alejandro Lozano · Min Woo Sun · James Burgess · Liangyu Chen · Jeffrey J Nirschl · Jeffrey Gu · Ivan Lopez · Josiah Aklilu · Austin Wolfgang Katzer · Collin Chiu · Anita Rau · Xiaohan Wang · Yuhui Zhang · Alfred Seunghoon Song · Robert Tibshirani · Serena Yeung
[ ExHall D ]
Abstract
The development of vision-language models (VLMs) is driven by large-scale and diverse multi-modal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are limited to narrow domains, missing the opportunity to leverage the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA: a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are additionally provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-LIP, a suite of CLIP-style models continuously pre-trained on BIOMEDICA dataset via streaming (eliminating the need to download 27 TB of data locally). On average, our models achieve state-of-the-art performance across 40 tasks — spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology — excelling in zero-shot classification with 5.57% average improvement (as high as 26.93% and 17.63% gains in surgery and ophthalmology, respectively) and stronger image-text retrieval while using 10x less compute.
Poster
XuDong Wang · Xingyi Zhou · Alireza Fathi · Trevor Darrell · Cordelia Schmid
[ ExHall D ]
Abstract
We present Visual Lexicon, an image representation that encodes visual information in the text space while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex captures both rich semantic content and fine visual details, facilitating high-quality image generation and visual scene understanding. Using a self-supervised learning pipeline, ViLex generates embeddings optimized for reconstructing input images through a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic level reconstruction. As visual embeddings in the text space, ViLex embeddings can be used independently as text tokens or combined with natural language tokens for zero-shot multimodal image generation. ViLex is also compatible with downstream vision-language tasks like visual question answering and referring expression segmentation, significantly enhancing performance. Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text-based embeddings—even with a single token. ViLex also performs various DreamBooth tasks in a zero-shot manner without the need for fine-tuning T2I models, and serves as a powerful vision encoder, consistently enhancing vision-language model performance across 15 benchmarks compared to a strong SigLIP baseline.
Poster
Fiona Ryan · Josef Sivic · Fabian Caba Heilbron · Judy Hoffman · James Rehg · Bryan Russell
[ ExHall D ]
Abstract
Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido'') from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries -- DeepFashion2 and ConConChi -- outperforming the prior art by 4%-22% on personal retrievals.
Poster
Jingyi Xie · Jintao Yang · Zhunchen Luo · Yunbo Cao · Qiang Gao · Mengyuan Zhang · Wenpeng Hu
[ ExHall D ]
Abstract
Adapting Multi-modal Large Language Models (MLLMs) to target tasks often suffers from catastrophic forgetting, where acquiring new task-specific knowledge compromises performance on pre-trained tasks. In this paper, we introduce AdaDARE-γ, an efficient approach that alleviates catastrophic forgetting by controllably injecting new task-specific knowledge through adaptive parameter selection from fine-tuned models without requiring retraining procedures. This approach consists two key innovations: (1) an adaptive parameter selection mechanism that identifies and retains the most task-relevant parameters from fine-tuned models, and (2) a controlled task-specific information injection strategy that precisely balances the preservation of pre-trained knowledge with the acquisition of new capabilities. Theoretical analysis proves the optimality of our parameter selection strategy and establishes bounds for the task-specific information injection factor. Extensive experiments on InstructBLIP and LLaVA-1.5 across image captioning and visual question answering tasks demonstrate that AdaDARE-γ establishes new state-of-the-art results in balancing model performance. Specifically, it maintains 98.2\% of pre-training effectiveness on original tasks while achieving 98.7\% of standard fine-tuning performance on target tasks.
Poster
Pavan Vasu Vasu · Fartash Faghri · Chun-Liang Li · Cem Koc · Nate True · Gokula Krishnan Santhanam · Albert Antony · James Gabriel · Peter Grasch · Oncel Tuzel · Hadi Pouransari
[ ExHall D ]
Abstract
Vision Language Models (VLMs) like LLaVA encode images into tokens aligned to the word embedding space of the LLM decoder. Scaling input image resolution is essential for improving performance, especially in text-rich image understanding tasks. However, popular visual encoders such as CLIP-pretrained ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD—a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. We provide a comprehensive efficiency analysis of the interplay between image resolution, vision latency, number of visual tokens, and LLM size. In the LLaVA-1.5 setup, we achieve 3.2× improvement in overall time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. On text-rich evaluations like TextVQA and DocVQA, FastVLM obtains +8.4\% and +12.5\% better accuracy than ConvLLaVA at a similar operating point of …
Poster
Zhi Zhang · Srishti Yadav · Fengze Han · Ekaterina Shutova
[ ExHall D ]
Abstract
The recent advancements in auto-regressive multi-modal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities---language and vision---in MLLMs, focusing on visual question answering.Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the …
Poster
Senqiao Yang · Yukang Chen · Zhuotao Tian · Chengyao Wang · Jingyao Li · Bei Yu · Jiaya Jia
[ ExHall D ]
Abstract
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs.However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform.Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5\% performance gains across nearly all settings.Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8× and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results.Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. All code and models will be publicly available.
Poster
Cheng Yang · Yang Sui · Jinqi Xiao · Lingyi Huang · Yu Gong · Chendi Li · Jinghua Yan · Yu Bai · Ponnuswamy Sadayappan · Xia Hu · Bo Yuan
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens required to represent visual information. Previous studies have observed that visual tokens often receive less attention than other tokens, such as system and instruction tokens, highlighting their lower relative importance during VLM inference and then pruning redundant visual tokens. However, previous approaches to token pruning encounter several challenges: reliance on heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce TopV, a compatible Token Pruning with inference Time Optimization for fast and low-memory VLM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores as the importance metric in the previous works, we formulate token pruning as an optimization problem, allowing us to accurately identify important visual tokens. By avoiding the need for attention scores, our approach maintains compatibility with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates several critical components. First, given the to-be-pruned source tokens, we investigate the appropriate positions of target tokens within the VLM layer. Then, we define a visual-aware cost function …
Poster
Wangbo Zhao · Yizeng Han · Jiasheng Tang · Zhikai Li · Yibing Song · Kai Wang · Zhangyang Wang · Yang You
[ ExHall D ]
Abstract
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains performance under aggressive pruning. However, it requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the aggregated attention map from a small VLM guide the pruning of visual tokens in a large VLM. Additionally, we develop a small VLM early exiting mechanism to make full use of the small VLM's predictions, dynamically invoking the …
Poster
Souhail Hadgi · Luca Moschella · Andrea Santilli · Diego Gomez · Qixing Huang · Emanuele Rodolà · Simone Melzi · Maks Ovsjanikov
[ ExHall D ]
Abstract
Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data …
Poster
Yahan Tu · Rui Hu · Jitao Sang
[ ExHall D ]
Abstract
Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.
Poster
jiajun cao · Yuan Zhang · Tao Huang · Ming Lu · Qizhe Zhang · Ruichuan An · Ningning Ma · Shanghang Zhang
[ ExHall D ]
Abstract
Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. The code will be released.
Poster
Jiazhen Liu · Yuhan Fu · Ruobing Xie · Runquan Xie · Xingwu Sun · Fengzong Lian · Zhanhui Kang · Xirong Li
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent …
Poster
Yongting Zhang · Lu Chen · Guodong Zheng · Yifeng Gao · Rui Zheng · Jinlan Fu · Zhenfei Yin · Senjie Jin · Yu Qiao · Xuanjing Huang · Feng Zhao · Tao Gui · Jing Shao
[ ExHall D ]
Abstract
The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open-source (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The construction of preference data is fully automated, and the experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness.
Poster
Yanbo Wang · Jiyang Guan · Jian Liang · Ran He
[ ExHall D ]
Abstract
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability.However, it is unknown what is actually learned in the high-quality dataset.Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction fine-tuning. Simply correcting the underlying …
Poster
Shuyang Hao · Bryan Hooi · Jun Liu · Kai-Wei Chang · Zi Huang · Yujun Cai
[ ExHall D ]
Abstract
Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75\% on MiniGPT-4 and 82.80\% on LLaVA-2, substantially outperforming existing methods by margins of 34.37\% and 12.77\% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11\% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.
Poster
Jiaming Zhang · Junhong Ye · Xingjun Ma · Yige Li · Yunfan Yang · Yunhao Chen · Jitao Sang · Dit-Yan Yeung
[ ExHall D ]
Abstract
Due to their multimodal capabilities, Vision-Language Models (VLMs) have found numerous impactful applications in real-world scenarios. However, recent studies have revealed that VLMs are vulnerable to image-based adversarial attacks, particularly targeted adversarial images that manipulate the model to generate harmful content specified by the adversary. Current attack methods rely on predefined target labels to create targeted adversarial attacks, which limits their scalability and applicability for large-scale robustness evaluations. In this paper, we propose **AnyAttack**, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision, allowing **any** image to serve as a target for the **attack**.Our framework employs the "pre-training and fine-tuning" paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset.This large-scale pre-training endows our method with powerful transferability across a wide range of VLMs.Extensive experiments on five mainstream open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks (image-text retrieval, multimodal classification, and image captioning) demonstrate the effectiveness of our attack.Additionally, we successfully transfer AnyAttack to multiple commercial VLMs, including Google Gemini, Claude Sonnet, Microsoft Copilot and OpenAI GPT.These results reveal an unprecedented risk to VLMs, highlighting the need for effective countermeasures.
Poster
Xin Wang · Kai Chen · Jiaming Zhang · Jingjing Chen · Xingjun Ma
[ ExHall D ]
Abstract
Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9\% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6\%.
Poster
Baoshun Tong · Hanjiang Lai · Yan Pan · Jian Yin
[ ExHall D ]
Abstract
Pre-trained Vision-Language Models (VLMs) like CLIP, have demonstrated strong zero-shot generalization capabilities. Despite their effectiveness on various downstream tasks, they remain vulnerable to adversarial samples. Existing methods fine-tune VLMs to improve their performance via performing adversarial training on a certain dataset. However, this can lead to model overfitting and is not a true zero-shot scenario. In this paper, we propose a truly zero-shot and training-free approach that can significantly improve the VLM's zero-shot adversarial robustness. Specifically, we first discover that simply adding Gaussian noise greatly enhances the VLM's zero-shot performance. Then, we treat the adversarial examples with added Gaussian noise as anchors and strive to find a path in the embedding space that leads from the adversarial examples to the cleaner samples. We improve the VLMs' generalization abilities in a truly zero-shot and training-free manner compared to previous methods. Extensive experiments on 16 datasets demonstrate that our method can achieve state-of-the-art zero-shot robust performance, improving the top-1 robust accuracy by an average of 9.77%. The code will be publicly available.
Poster
Julio Silva-Rodríguez · Ismail Ben Ayed · Jose Dolz
[ ExHall D ]
Abstract
Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three popular non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being ×15 faster than popular transductive approaches.
Poster
Ashshak Sharifdeen · Muhammad Akhtar Munir · Sanoojan Baliah · Salman Khan · Muhammad Haris Khan
[ ExHall D ]
Abstract
Test-time prompt tuning for vision-language models (VLMs) are getting attention due to their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMsTowards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion.We conduct extensive experiments on various datasets with different backbones and baselines. Results indicate that our method consistently outperforms the state-of-the-art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks. Our code will be made public upon acceptance.
Poster
Yicheng Chen · Xiangtai Li · Yining Li · Yanhong Zeng · Jianzong Wu · Xiangyu Zhao · Kai Chen
[ ExHall D ]
Abstract
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.However, leveraging these models to boost performance on downstream tasks with synthetic data poses several challenges, including aligning with real data distribution, scaling synthetic sample volumes, and ensuring their quality. To bridge these gaps, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality cross-modality training samples at scale to augment perception and multi-modal training. ACP first uses LLMs to sample descriptions and layouts based on object combinations from real data priors, eliminating the need for ground truth image captions or annotations. Next, we use an off-the-shelf controllable diffusion model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric, Composite Layout and Image Score (CLIS), to ensure quality. Our customized synthetic high-quality samples boost performance in various scenarios, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that ACP can significantly improve the performance of existing models. In addition, we find a positive correlation between CLIS and performance gains in downstream tasks. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. …
Poster
Amir Bar · Gaoyue Zhou · Danny Tran · Trevor Darrell · Yann LeCun
[ ExHall D ]
Abstract
Navigation is a fundamental skill of agents with visual-motor capabilities. We propose a Navigation World Model (NWM), a controllable video generation model that predicts the future visual observation given the past observations and navigation actions. NWM is a Conditional Diffusion Transformer (CDiT) trained on the video footage of robots as well as unlabeled egocentric video data. We scale the model up to 1B parameters and train it over human and robot agents data from numerous environments and embodiments. Our model scales favorably on known and unknown environments and can leverage unlabeled egocentric video data. NWM exhibits improved navigation planning skills either by planning from scratch or by ranking proposals from an external navigation policy. Compared to existing supervised navigation models which are hard coded'', NWM can incorporate new constraints when planning trajectories. NWM learns visual priors that enable it to imagine navigation trajectories based on just a single input image.
Poster
Bikang Pan · Qun Li · Xiaoying Tang · Wei Huang · Zhen Fang · Feng Liu · Jingya Wang · Jingyi Yu · Ye Shi
[ ExHall D ]
Abstract
The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive …
Poster
Long Tung Vuong · Hoang Phan · Vy Vo · Anh Tuan Bui · Thanh-Toan Do · Trung Le · Dinh Phung
[ ExHall D ]
Abstract
Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains deviates from visual embedding distribution pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforcing these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we then transform this insight into a novel strategy to enforce the clustering property in text embeddings. Our experiments and ablation studies validate the effectiveness of the proposed …
Poster
Tianyu Yu · Haoye Zhang · Qiming Li · Qixin Xu · Yuan Yao · Da Chen · Xiaoman Lu · Ganqu Cui · Yunkai Dang · Taiwen He · Xiaocheng Feng · Jun Song · Bo Zheng · Zhiyuan Liu · Tat-seng Chua · Maosong Sun
[ ExHall D ]
Abstract
Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models.This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs.In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation for preference learning and self-feedback guidance for inference-time scaling.Extensive experiments on seven benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models at both preference learning and inference time. RLAIF-V 7B reduces object hallucination by 80.7\% and overall hallucination by 33.7\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of open-source MLLMs, where the model can learn from feedback of itself to achieve super GPT-4V trustworthiness.
Poster
Jiahao Xie · Alessio Tonioni · Nathalie Rauschmayr · Federico Tombari · Bernt Schiele
[ ExHall D ]
Abstract
Visual in-context learning (ICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing visual ICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can learn adaptive visual ICL models on the fly with a single test sample. Specifically, We flip the role between task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks with 15 corruptions demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. Code will be released to facilitate future research.
Poster
Pramit Saha · Felix Wagner · Divyanshu Mishra · Can Peng · Anshul Thakur · David A. Clifton · Konstantinos Kamnitsas · Alison Noble
[ ExHall D ]
Abstract
Effective training of large Vision-Language Models (VLMs) on resource-constrained client devices in Federated Learning (FL) requires the usage of parameter-efficient fine-tuning (PEFT) strategies. To this end, we demonstrate the impact of two factors \textit{viz.}, client-specific layer importance score that selects the most important VLM layers for fine-tuning and inter-client layer diversity score that encourages diverse layer selection across clients for optimal VLM layer selection. We first theoretically motivate and leverage the principal eigenvalue magnitude of layerwise Neural Tangent Kernels and show its effectiveness as client-specific layer importance score. Next, we propose a novel layer updating strategy dubbed \textbf{F3OCUS} that jointly optimizes the layer importance and diversity factors by employing a data-free, multi-objective, meta-heuristic optimization on the server. We explore 5 different meta-heuristic algorithms and compare their effectiveness for selecting model layers and adapter layers towards PEFT-FL. Furthermore, we release a new MedVQA-FL dataset involving overall 707,962 VQA triplets and 9 modality-specific clients and utilize it to train and evaluate our method. Overall, we conduct more than 10,000 client-level experiments on 6 Vision-Language FL task settings involving 58 medical image datasets and 4 different VLM architectures of varying sizes to demonstrate the effectiveness of the proposed method.
Poster
Arne Grobrügge · Niklas Kühl · Gerhard Satzger · Philipp Spitzer
[ ExHall D ]
Abstract
Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model's decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD's explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD paired with the completeness relation enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. Code will be available for research purposes upon acceptance.
Poster
Jinhong Lin · Cheng-En Wu · Huanran Li · Jifan Zhang · Yu Hen Hu · Pedro Morgado
[ ExHall D ]
Abstract
Masked Image Modeling (MIM) has emerged as a powerful self-supervised learning paradigm for visual representation learning, enabling models to acquire rich visual representations by predicting masked portions of images from their visible regions. While this approach has shown promising results, we hypothesize that its effectiveness may be limited by optimization challenges during early training stages, where models are expected to learn complex image distributions from partial observations before developing basic visual processing capabilities. To address this limitation, we propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our approach introduces a temperature-based annealing scheme that gradually expands the training distribution, enabling more stable and efficient learning trajectories. Through extensive experiments on ImageNet-1K, we demonstrate that our curriculum learning strategy significantly improves both training efficiency and representation quality while requiring substantially fewer training epochs compared to standard Masked Auto-Encoding. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning, providing a practical solution to the early-stage optimization challenges in MIM.
Poster
Yancheng Cai · Fei Yin · Dounia Hammou · Rafal Mantiuk
[ ExHall D ]
Abstract
Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on …
Poster
Duolikun Danier · Mehmet Aygun · Changjian Li · Hakan Bilen · Oisin Mac Aodha
[ ExHall D ]
Abstract
Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
Poster
Zhaoqing Wang · Xiaobo Xia · Runnan Chen · Dongdong Yu · Changhu Wang · Mingming Gong · Tongliang Liu
[ ExHall D ]
Abstract
This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be …
Poster
Dongshuo Yin · Leiyi Hu · Bin Li · Youqun Zhang · Xue Yang
[ ExHall D ]
Abstract
Pre-training & fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like object detection and segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability for processing visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled layernorm in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, oriented object detection on DOTA/STAR, and image classification on three common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks by tuning less than 5% params of the backbone, and is the only delta-tuning method outperforming full fine-tuning on all tasks. For example, Mona achieves 1% performance gain on the COCO dataset …
Poster
Uranik Berisha · Jens Mehnert · Alexandru Paul Condurache
[ ExHall D ]
Abstract
Vision Transformers (ViTs) have emerged as the state-of-the-art models in various Computer Vision (CV) tasks, but their high computational and resource demands pose significant challenges. While Mixture of Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model’s MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing FLOPs and model size, by up to 36% and 32% respectively.
Poster
Lixu Wang · Bingqi Shang · Yi Li · Payal Mohapatra · Wei Dong · Xiao Wang · Qi Zhu
[ ExHall D ]
Abstract
Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot …
Poster
Jialai Wang · Yuxiao Wu · Weiye Xu · Yating Huang · Chao Zhang · Zongpeng Li · Mingwei Xu · Zhenkai Liang
[ ExHall D ]
Abstract
Vision Transformers (ViTs) have experienced significant progress and are quantized for deployment in resource-constrained applications. Quantized models are vulnerable to targeted bit-flip attacks (BFAs). A targeted BFA prepares a trigger and a corresponding Trojan/backdoor, inserting the latter (with RowHammer bit flipping) into a victim model, to mislead its classification on samples containing the trigger. Existing targeted BFAs on quantized ViTs are limited in that: (1) they require numerous bit-flips, and (2) the separation between flipped bits is below 4 KB, making attacks infeasible with RowHammer in real-world scenarios. We propose a new and practical targeted attack Flip-S against quantized ViTs. The core insight is that in quantized models, a scale factor change ripples through a batch of model weights. Consequently, flipping bits in scale factors, rather than solely in model weights, enables more cost-effective attacks. We design a Scale-Factor-Search (SFS) algorithm to identify critical bits in scale factors for flipping, and adopt a mutual exclusion strategy to guarantee a 4 KB separation between flips. We evaluate Flip-S on CIFAR-10 and ImageNet datasets across five ViT architectures and two quantization levels. Results show that Flip-S achieves attack success rate (ASR) exceeding 90.0\% on all models with 50 bits flipped, outperforming baselines …
Poster
Xinglong Sun · Barath Lakshmanan · Maying Shen · Shiyi Lan · Jingde Chen · Jose M. Alvarez
[ ExHall D ]
Abstract
Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning(MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities—including channels, query/key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28\% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37\% acceleration with a +0.7 Top-1 accuracy improvement.
Poster
Fei Xie · Jiahao Nie · Yujin Tang · Wenkang Zhang · Hongshen Zhao
[ ExHall D ]
Abstract
Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task adapter for Mamba models, which consists of two functional modules: Adaptor-T and Adapator-S. When solving the hidden states for SSM, we apply a casual prediction module Adaptor-T to select a set of learnable locations as memory augmentation feature states to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both two modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that …
Poster
Yuan Zhou · Qingshan Xu · Jiequan Cui · Junbao Zhou · Jing Zhang · Richang Hong · Hanwang Zhang
[ ExHall D ]
Abstract
Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new deCoupled duAl-interactive lineaR attEntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving 78.4/82.1% top-1 accuracy on ImagegNet-1K at the cost of only 0.7/1.9 GMACs. Codes will be released on github.
Poster
Zichen Miao · WEI CHEN · Qiang Qiu
[ ExHall D ]
Abstract
Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods …
Poster
Caoshuo Li · Tanzhe Li · Xiaobin Hu · Donghao Luo · Taisong Jin
[ ExHall D ]
Abstract
Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed **D**ilated **V**ision **H**yper**G**raph **N**eural **N**etwork (DVHGNN), which is designed to leverage multi-scale hypergraph to *efficiently* capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and **D**ilated **H**yper**G**raph **C**onstruction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of **83.1\%** on ImageNet-1K, surpassing ViG-S by **+1.0**↑ and ViHGNN-S by **+0.6**↑.
Poster
Ruiheng Liu · Haozhe Chen · Boyao Zhao · Kejiang Chen · Weiming Zhang
[ ExHall D ]
Abstract
The advancement of AI technology has significantly influenced production activities, increasing the focus on copyright protection for AI models. Model perceptual hashing offers an efficient solution for retrieving the pirated models. Existing methods, such as handcrafted feature-based and dual-branch network-based perceptual hashing, have proven effective in detecting pirated models. However, these approaches often struggle to differentiate non-pirated models, leading to frequent false positives in model authentication and protection. To address this challenge, this paper proposes a structurally-aware perceptual model hashing technique that achieved reduced false positives while maintaining high true positive rates. Specifically, we introduce a method for converting the diverse neural network structures into graph structures suitable for DNN processing, then utilize a graph neural network to learn their structural features representation. Our approach integrates perceptual parameter-based model hashing, achieving robust performance with higher detection accuracy and fewer false positives. Experimental results show that the proposed method has only 3\% false positive rate when detecting the non-pirated model, and the detection accuracy of pirated model reaches more than 98\%.
Poster
Yang Liu · Tianwei Zhang · Shi Gu
[ ExHall D ]
Abstract
Concept Bottleneck Models (CBMs) provide an interpretable framework for neural networks by mapping visual features to predefined, human-understandable concepts. However, the application of CBMs is often constrained by insufficient concept annotations. Recently, multi-modal pre-trained models have shown promise in reducing annotation costs by aligning visual representations with textual concept embeddings. Nevertheless, the quality and completeness of the predefined concepts significantly affect the performance of CBMs.In this work, we propose Hybrid Concept Bottleneck Model (HybridCBM), a novel CBM framework to address the challenge of incomplete predefined concepts. Our method consists of two main components: a Static Concept Bank and a Dynamic Concept Bank. The Static Concept Bank directly leverages large language models (LLMs) for concept construction, while the Dynamic Concept Bank employs learnable vectors to capture complementary and valuable concepts continuously during training. After training, a pre-trained translator converts these vectors into human-understandable concepts, further enhancing model interpretability. Notably, HybridCBM is highly flexible and can be easily applied to any CBM to improve performance. Experimental results across multiple datasets demonstrate that HybridCBM outperforms current state-of-the-art methods and achieves comparable results to black-box models. Additionally, we propose novel metrics to evaluate the quality of the learned concepts, showing that they perform comparably …
Poster
Sanghyun Kim · Deunsol Jung · Minsu Cho
[ ExHall D ]
Abstract
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings.However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions.To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations.The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches.The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object.By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs.Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
Poster
Dianmo Sheng · Dongdong Chen · Zhentao Tan · Qiankun Liu · Qi Chu · Tao Gong · Bin Liu · Jing Han · Wenbin Tu · Shengwei Xu · Nenghai Yu
[ ExHall D ]
Abstract
Recent advancements in in-context segmentation generalists have demonstrated significant success in performing various image segmentation tasks using a limited number of labeled example images. However, real-world applications present challenges due to the variability of support examples, which often exhibit quality issues resulting from various sources and inaccurate labeling. How to extract more robust representations from these examples has always been one of the goals of in-context visual learning. In response, we propose UNICL-SAM, to better model the example distribution and extract robust representations to help in-context segmentation. We incorporate an uncertainty probabilistic module to quantify each example's reliability during both the training and testing phases. Utilizing this uncertainty estimation, we introduce an uncertainty-guided graph augmentation and feature refinement strategy, aimed at mitigating the impact of high-uncertainty regions to enhance the learning of robust representations. Subsequently, we construct prototypes for each example by aggregating part information, thereby creating reliable in-context instruction that effectively represents fine-grained local semantics. This approach serves as a valuable complement to traditional global pooling features. Experimental results demonstrate the effectiveness of the proposed framework, underscoring its potential for real-world applications.
Poster
ZhengYang Wang · Tingliang Feng · Fan Lyu · Fanhua Shang · Wei Feng · Liang Wan
[ ExHall D ]
Abstract
Open-vocabulary semantic segmentation aims to enable models to segment arbitrary categories. Currently, though pre-trained Vision-Language Models (VLMs) like CLIP have established a robust foundation for this task by learning to match text and image representations from large-scale data, their lack of pixel-level recognition necessitates further fine-tuning. Most existing methods leverage text as a guide to achieve pixel-level recognition. However, the inherent biases in text semantic descriptions and the lack of pixel-level supervisory information make it challenging to fine-tune CLIP-based models effectively. This paper considers leveraging image-text data to simultaneously capture the semantic information contained in both image and text, thereby constructing Dual Semantic Guidance and corresponding pixel-level pseudo annotations. Particularly, the visual semantic guidance is enhanced via explicitly exploring foreground regions and minimizing the influence of background. The dual semantic guidance is then jointly utilized to fine-tune CLIP-based segmentation models, achieving decent fine-grained recognition capabilities. As the comprehensive evaluation shows, our method outperforms state-of-art results with large margins, on eight commonly used datasets with/without background.
Poster
Zhiwei Yang · Yucong Meng · Kexue Fu · feilong tang · Shuo Wang · Zhijian Song
[ ExHall D ]
Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image-text alignment for CAM generation, while CLIP’s potential in patch-text alignment remains under-explored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch-text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset-wide knowledge base and enriches the text representations with an implicit attribute-hunting process. To mine fine-grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine-grained knowledge in a non-parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP’s training-free advantages but also significantly outperforms other state-of-the-art methods with much less training cost on PASCAL VOC and MS COCO. Codes will be available.
Poster
Chen Yi Lu · Kasra Derakhshandeh · Somali Chaterji
[ ExHall D ]
Abstract
Semi-supervised semantic segmentation with consistencyregularization capitalizes on unlabeled images to enhancethe accuracy of pixel-level segmentation. Current consistencylearning methods primarily rely on the consistency loss be-tween pseudo-labels and unlabeled images, neglecting the in-formation within the feature representations of the backboneencoder. Preserving maximum information in feature embed-dings requires achieving the alignment and uniformity objec-tives, as widely studied. To address this, we present SWSEG,a semi-supervised semantic segmentation algorithm that opti-mizes alignment and uniformity using the Sliced-WassersteinDistance (SWD), and rigorously and empirically proves thisconnection. We further resolve the computational issues as-sociated with conventional Monte Carlo-based SWD by im-plementing a Gaussian-approximated variant, which not onlymaintains the alignment and uniformity objectives but alsoimproves training efficiency. We evaluate SWSEG on thePASCAL VOC 2012, Cityscapes, and ADE20K datasets, out-shining supervised baselines in mIoU by up to 11.8%, 8.9%,and 8.2%, respectively, given an equivalent number of labeledsamples. Further, SWSEG surpasses state-of-the-art methodsin multiple settings across these three datasets. Our extensiveablation studies confirm the optimization of the uniformityand alignment objectives of the feature representations.
Poster
Zhongwen Zhang · Yuri Boykov
[ ExHall D ]
Abstract
We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by introducing hidden pseudo-labels and powerful CRF sub-problem solvers, e.g. graph cut. However, previously used hard pseudo-labels can not represent class uncertainty or errors, which motivates soft self-labeling. We derive a principled auxiliary loss and systematically evaluate standard and new CRF relaxations (convex and non-convex), neighborhood systems, and terms connecting network predictions with soft pseudo-labels. We also propose a general continuous sub-problem solver. Using only standard architectures, soft self-labeling consistently improves scribble-based training and outperforms significantly more complex specialized WSSS systems. It can outperform full pixel-precise supervision. Our general ideas apply to other weakly-supervised problems/systems.
Poster
Zhaochen Liu · Limeng Qiao · Xiangxiang Chu · Lin Ma · Tingting Jiang
[ ExHall D ]
Abstract
Aiming to predict the complete shape of partially occluded objects, amodal segmentation is an important capacity towards visual intelligence. In order to promote the practicability, zero-shot foundation model competent for the open world gains growing attention in this field. Nevertheless, prior models exhibit deficiencies in efficiency and stability. To address this problem, utilizing the implicit prior knowledge, we propose the first SAM-based amodal segmentation foundation model, SAMBA. Methodologically, a novel framework with multilevel facilitation is designed to better adapt the task characteristics and unleash the potential capabilities of SAM. In the modality level, a separation-to-fusion structure is employed that jointly learns modal and amodal segmentation to enhance mutual coordination. In the instance level, to ease the complexity of amodal feature extraction, we introduce a principal focusing mechanism to indicate objects of interest. In the pixel level, mixture-of-experts is incorporated with a specialized distribution loss, by which distinct occlusion rates correspond to different experts to improve the accuracy. Experiments are conducted on several eminent datasets, and the results show that the performance of SAMBA is superior to existing zero-shot and even supervised approaches. Furthermore, our proposed model has notable advantages in terms of speed and size. The model and code will …
Poster
Dongkai Wang · Jiang Duan · Liangjian Wen · Shiyu Xuan · Hao CHEN · Shiliang Zhang
[ ExHall D ]
Abstract
Generalizable object keypoint localization is a fundamental computer vision task in understanding the object structure. It is challenging for existing keypoint localization methods because their limited training data cannot provide generalizable shape and semantic cues, leading to inferior performance and generalization capability. Instead of relying on large scale training data, this work tackles this challenge by exploiting the rich priors from large generative models. We propose a data-efficient generalizable localization method named GenLoc. GenLoc extracts the generative priors from a pre-trained image generation model by calculating the correlation map between image latent feature and condition embedding. Those priors are hence optimized with our proposed heatmap expectation loss to perform object keypoint localization. Benefited by the rich knowledge of generative priors in understanding of object semantics and structures, GenLoc achieves superior performance on various object keypoint localization benchmarks. It shows more substantial performance enhancements in cross-domain, few-shot and zero-shot evaluation settings, e.g., getting 20\%+ AP enhancement over CLAMP in various zero-shot settings.
Poster
Huajie Jiang · Zhengxian Li · Xiaohan Yu · Yongli Hu · Baocai Yin · Jian Yang · Yuankai Qi
[ ExHall D ]
Abstract
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visual-semantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.
Poster
Libiao Chen · Dong Nie · Junjun Pan · Jing Yan · Zhenyu Tang
[ ExHall D ]
Abstract
Generalized Zero-Shot Learning (GZSL) addresses the challenge of classifying unseen classes in the presence of seen classes by leveraging semantic attributes to bridge the gap for unseen classes. However, in image based disease classification, such as glioma sub-typing, distinguishing between classes using image semantic attributes can be challenging. To address this challenge, we introduce a novel GZSL method that eliminates the dependency on semantic information. Specifically, we propose that the primary of most classification in clinic is risk stratification, and classes are inherently ordered rather than purely categorical. Based on this insight, we present an inter-class feature augmentation (IFA) module, where distributions of different classes are ordered by their risk levels in a learned feature space using pre-defined conditional Gaussian distribution model. This ordering enables the generation of unseen class features through feature mixing of adjacent seen classes, effectively transforming the zero-shot learning problem into a supervised learning task. Our method eliminates the need for explicit semantic information, avoiding the potential domain shift between visual and semantic features. Moreover, the IFA module provides a simple yet effective solution for zero-shot classification, requiring no structural modifications to the existing classification models. In the experiment, both in-house and public datasets are used …
Poster
Enguang Wang · Zhimao Peng · Zhengyuan Xie · Fei Yang · Xialei Liu · Ming-Ming Cheng
[ ExHall D ]
Abstract
Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes.Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. As a different modality, text information can provide complementary discriminative information, which motivates us to introduce it into the GCD task.However, the lack of class names for unlabelled data makes it impractical to utilize text information.To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP’s text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visualand semantic information mutually enhance each other,promoting the interaction and fusionof visual and text knowledge.Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by alarge margin on all GCD benchmarks, achieving new state-of-the-art.
Poster
Chang-Bin Zhang · Jinhong Ni · Yujie Zhong · Kai Han
[ ExHall D ]
Abstract
In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, e.g., texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called View-Consistent leaRning (VCR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In VCR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our VCR on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance.
Poster
Muli Yang · Gabriel James Goenawan · Huaiyuan Qin · Kai Han · Xi Peng · Yanhua Yang · Hongyuan Zhu
[ ExHall D ]
Abstract
Despite being trained on massive data, today's vision foundation models still fall short in detecting open world objects. Apart from recognizing known objects appeared in training, a successful Open World Object Detection (OWOD) system must also be able to detect unknown objects that were never seen before, without confusing them with the backgrounds. Unlike the prevailing former works that learn "objectness" using probability models, we focus on learning fine-grained class-agnostic attributes that can be used to detect both known and unknown object classes in an explainable manner. In this paper, we propose Partial Attribute Assignment (PASS), aiming to automatically select and optimize a small, relevant subset of attributes from a larger attribute pool. Specifically, we model attribute selection as a Partial Optimal Transport (POT) problem between known visual objects and the attribute pool, in which more relevant attributes signify more transported mass. PASS follows a curriculum schedule that progressively selects and optimizes a targeted subset of attributes during training, promoting stability and accuracy. Our method enjoys end-to-end optimization by minimizing the POT distance and the classification loss on known visual objects, demonstrating high training efficiency and superior OWOD performance among extensive experimental evaluations. Our code will be made public.
Poster
Jiangyi Wang · Na Zhao
[ ExHall D ]
Abstract
Active learning has emerged as a promising approach to reduce the substantial annotation burden in 3D object detection tasks, spurring several initiatives in outdoor environments. However, its application in indoor environments remains unexplored. Compared to outdoor 3D datasets, indoor datasets face significant challenges, including fewer training samples per class, a greater number of classes, more severe class imbalance, and more diverse scene types and intra-class variances.This paper presents the first study on active learning for indoor 3D object detection, where we propose a novel framework tailored for this task. Our method incorporates two key criteria - uncertainty and diversity - to actively select the most ambiguous and informative unlabeled samples for annotation. The uncertainty criterion accounts for both inaccurate detections and undetected objects, ensuring that the most ambiguous samples are prioritized. Meanwhile, the diversity criterion is formulated as a joint optimization problem that maximizes the diversity of both object class distributions and scene types, using a new Class-aware Adaptive Prototype (CAP) bank. The CAP bank dynamically allocates representative prototypes to each class, helping to capture varying intra-class diversity across different categories.We evaluate our method on SUN RGB-D and ScanNetV2, where it outperforms baselines by a significant margin, achieving over 85\% …
Poster
Shizhou Zhang · Xueqiang Lv · Yinghui Xing · Qirui Wu · Di Xu · Yanning Zhang
[ ExHall D ]
Abstract
Generative replay has gained significant attention in class-incremental learning; however, its application to Class Incremental Object Detection (CIOD) remains limited due to the challenges in generating complex images with precise spatial arrangements. In this study, motivated by the observation that the forgetting of prior knowledge is predominantly present in the classification sub-task as opposed to the localization sub-task, we revisit the generative replay method for class incremental object detection. Our method utilize a standard Stable Diffusion model to generate image-level replay data for all old and new tasks. Accordingly, the old detector and a stage-wise detector are conducted on the synthetic images respectively to determine the bounding box positions through pseudo-labeling. Furthermore, we propose to use a Similarity-based Cross Sampling mechanism to select more valuable confusing data between old and new tasks to more effectively mitigate catastrophic forgetting and reduce the false alarm rate for the new task. Finally, all synthetic and real data are integrated for current-stage detector training, where the images generated for previous tasks are highly beneficial in minimizing the forgetting of existing knowledge, while those synthesized for the new task can help bridge the domain gap between real and synthetic images. We conducted extensive experiments on …
Poster
Yu Zhou · Dian Zheng · Qijie Mo · Ren-Jie Lu · Kun-Yu Lin · Wei-Shi Zheng
[ ExHall D ]
Abstract
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes.To address it, we refine the retention term using dark knowledge” and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class.Without access to the remaining data or intervention (\ie, used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
Poster
Mauricio Byrd Victorica · György Dán · Henrik Sandberg
[ ExHall D ]
Abstract
Adversarial patches are capable of misleading computer vision systems based on convolutional neural networks. Existing recovery methods suffer of at least one of three fundamental shortcomings: no information about the presence of patches in the scene, inability to efficiently handle noncontiguous patch attacks, and a strong reliance on fixed saliency thresholds. We propose Saliuitl, a recovery method independent of the number of patches and their shape, which unlike prior works, explicitly detects patch attacks before attempting recovery. In our approach, detection is based on the attributes of a binarized feature map ensemble, which is generated by using an ensemble of saliency thresholds. If an attack is detected, Saliuitl recovers clean predictions locating patches guided by an ensemble of binarized feature maps and inpainting them. We evaluate Saliuitl on widely used object detection and image classification benchmarks from the adversarial patch literature, and our results show that compared to recent state-of-the-art defenses, Saliuitl achieves a recovery rate up to 97.81 and 42.63 percentage points higher at the same rate of lost predictions for image classification and object detection, respectively. By design, Saliuitl has low computational complexity and is robust to adaptive white-box attacks. Our code is available at https://github.com/Saliuitl/Saliuitl/tree/main.
Poster
Jiacong Xu · Shao-Yuan Lo · Bardia Safaei · Vishal M. Patel · Isht Dwivedi
[ ExHall D ]
Abstract
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in anomaly detection and reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning, based on LLaVA-OneVision. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens for its LLM. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Furthermore, extensions to medical and 3D anomaly reasoning are provided for future study.
Poster
Mojtaba Nafez · Amirhossein Koochakian · Arad Maleki · Jafar Habibi · Mohammad Rohban
[ ExHall D ]
Abstract
Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities.We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis.Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of 53.2% in AD and 68.5% in AL, while also maintaining competitive accuracy in non-adversarial settings.
Poster
Ankan Kumar Bhunia · Changjian Li · Hakan Bilen
[ ExHall D ]
Abstract
This paper introduces a novel anomaly detection (AD) problem aimed at identifying `odd-looking' objects within a scene by comparing them to other objects present. Unlike traditional AD benchmarks with fixed anomaly criteria, our task detects anomalies specific to each scene by inferring a reference group of regular objects. To address occlusions, we use multiple views of each scene as input, construct 3D object-centric models for each instance from 2D views, enhancing these models with geometrically consistent part-aware representations. Anomalous objects are then detected through cross-instance comparison. We also introduce two new benchmarks, ToysAD-8K and PartsAD-15K as testbeds for future research in this task. We provide a comprehensive analysis of our method quantitatively and qualitatively on these benchmarks. The datasets, source code, and models will be made publicly available upon publication.
Poster
Jia Guo · Shuai Lu · Weihang Zhang · Fang Chen · Hongen Liao · Huiqi Li
[ ExHall D ]
Abstract
Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we present Dinomaly, a minimalist reconstruction-based anomaly detection framework that harnesses pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisting of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Scalable foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of __99.6__%, __98.7__%, and __89.3__% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.
Poster
Fuyun Wang · Tong Zhang · Yuanzhi Wang · Yide Qiu · Xin Liu · Xu Guo · Zhen Cui
[ ExHall D ]
Abstract
In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.
Poster
Mohamed Afane · Gabrielle Ebbrecht · Ying Wang · Juntao Chen · Junaid Farooq
[ ExHall D ]
Abstract
Quantum Neural Networks (QNNs) offer promising capabilities for complex data tasks, but are often constrained by limited qubit resources and high entanglement, which can hinder scalability and efficiency. In this paper, we introduce Adaptive Threshold Pruning (ATP), an encoding method that reduces entanglement and optimizes data complexity for efficient computations in QNNs. ATP dynamically prunes non-essential features in the data based on adaptive thresholds, effectively reducing quantum circuit requirements while preserving high performance. Extensive experiments across multiple datasets demonstrate that ATP reduces entanglement entropy and improves adversarial robustness when combined with adversarial training methods like FGSM. Our results highlight ATP’s ability to balance computational efficiency and model resilience, achieving significant performance improvements with fewer resources, which will help make QNNs more feasible in practical, resource-constrained settings.
Poster
Yanda Chen · Gongwei Chen · Miao Zhang · Weili Guan · Liqiang Nie
[ ExHall D ]
Abstract
Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings.Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation.CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum.Extensive experiments demonstrate the effectiveness of CCFS, achieving significant improvements over the state-of-the-art: +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet in high-IPC settings. Notably, we achieve 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, yielding similar performance as full dataset training with only 0.3\% performance degradation.
Poster
Byeongho Heo · Taekyung Kim · Sangdoo Yun · Dongyoon Han
[ ExHall D ]
Abstract
Pre-training with random masked inputs has emerged as a novel trend in self-supervised training. However, supervised learning still faces a challenge in adopting masking augmentations, primarily due to unstable training. In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-model (MaskSub). MaskSub consists of the main-model and sub-model, the latter being a part of the former. The main-model undergoes conventional training recipes, while the sub-model merits intensive masking augmentations, during training. MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss. Our analysis shows that MaskSub significantly improves performance, with the training loss converging faster than in standard training, which suggests our method stabilizes the training process. We further validate MaskSub across diverse training scenarios and models, including DeiT-III training, MAE finetuning, CLIP finetuning, BERT training, and hierarchical architectures (ResNet and Swin Transformer). Our results show that MaskSub consistently achieves significant performance gains across all the cases. MaskSub provides a practical and effective solution for introducing additional regularization under various training recipes. Our code will be publicly available.
Poster
Qing Zhou · Junyu Gao · Qi Wang
[ ExHall D ]
Abstract
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement. To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum. We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples. SeTa reduces training costs by up to 50% while maintaining or improving performance, with minimal degradation even at 70% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (including CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of …
Poster
Eliahu Horwitz · Bar Cavia · Jonathan Kahana · Yedid Hoshen
[ ExHall D ]
Abstract
The increasing availability of public models begs the question: can we train neural networks that use other networks as input? Such models allow us to study different aspects of a given neural network, for example, determining the categories in a model's training dataset. However, machine learning on model weights is challenging as they often exhibit significant variation unrelated to the models' semantic properties (nuisance variation). Here, we identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. Concretely, while learning across Model Trees requires complex architectures, even a linear classifier trained on a single model layer often works within trees. While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. Notably, ProbeX is the first probing method specifically designed to learn from the weights of a single hidden model layer. We demonstrate the effectiveness of ProbeX by predicting the categories …
Poster
Sebastian Dziadzio · Vishaal Udandarao · Karsten Roth · Ameya Prabhu · Zeynep Akata · Samuel Albanie · Matthias Bethge
[ ExHall D ]
Abstract
Model merging combines expert'' models---each finetuned from a shared foundation model on diverse tasks and domains---into a single, more capable base model. However, existing model merging approaches assume all experts to be available simultaneously.In reality, new tasks and domains emerge continuously, prompting the need for a dynamic process of integrating these experts over time, which we call \textit{temporal model merging}. The temporal dimension introduces unique challenges not addressed in prior work:At each task, should expert training start from merged previous experts or the original base model? Should all models be merged at every time step? Which merging techniques are best suited for temporal merging? Should different strategies be used for the training initialization and deployment phases? To tackle these questions, we propose a unified framework called \textsc{TIME}---\underline{T}emporal \underline{I}ntegration of \underline{M}odel \underline{E}xpertise---that defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Utilizing \textsc{TIME}, we study temporal model merging across model sizes, tasks, and compute budgets on the large-scale FoMo-in-Flux benchmark for continual multimodal pretraining. Systematic experiments across \textsc{TIME} and FoMo-in-Flux allow us to arrive at several crucial key insights for temporal model merging to better understand current limits and best practices for successful …
Poster
Xiaohan Qin · Xiaoxing Wang · Junchi Yan
[ ExHall D ]
Abstract
Multi-task learning (MTL) can leverage shared knowledge across tasks to improve data efficiency and generalization performance, and has been applied in various scenarios. However, task imbalance remains a major challenge for existing MTL methods. While the prior works have attempted to mitigate inter-task unfairness through loss-based and gradient-based strategies, they still exhibit imbalanced performance across tasks on common benchmarks.This key observation motivates us to consider performance-level information as an explicit fairness indicator, which can precisely reflect the current optimization status of each task, and accordingly help to adjust the gradient aggregation process.Specifically, we utilize the performance variance among tasks as the fairness indicator and introduce a dynamic weighting strategy to gradually reduce the performance variance. Based on this, we propose PIVRG, a novel performance-informed variance reduction gradient aggregation approach.Extensive experiments show that PIVRG achieves SOTA performance across various benchmarks, spanning both supervised learning and reinforcement learning tasks with task numbers ranging from 2 to 40. Results from the ablation study also show that our approach can be integrated into existing methods, significantly enhancing their performance while reducing the performance variance among tasks, thus achieving fairer optimization.
Poster
Sihao Liu · Yibo Yang · Xiaojie Li · David A. Clifton · Bernard Ghanem
[ ExHall D ]
Abstract
Online continual learning (OCL) seeks to learn new tasks from data streams that appear only once, while retaining knowledge of previously learned tasks. Most existing methods rely on replay, focusing on enhancing memory retention through regularization or distillation. However, they often overlook the adaptability of the model, limiting the ability to learn generalizable and discriminative features incrementally from online training data.To address this, we introduce a plug-and-play module, S6MOD, which can be integrated into most existing methods and directly improve adaptability. Specifically, S6MOD introduces an extra branch after the backbone, where a mixture of discretization selectively adjusts parameters in a selective state space model, enriching selective scan patterns such that the model can adaptively select the most sensitive discretization method for current dynamics.We further design a class-conditional routing algorithm for dynamic, uncertainty-based adjustment and implement a contrastive discretization loss to optimize it. Extensive experiments combining our module with various models demonstrate that S6MOD significantly enhances model adaptability, leading to substantial performance gains and achieving the state-of-the-art results.
Poster
Fei Ye · Adrian Bors
[ ExHall D ]
Abstract
Recent continuous learning (CL) research primarily addresses catastrophic forgetting within a straightforward learning framework where class and task information are predefined. However, in more realistic and challenging CL scenarios, such supervised information is typically absent. In this paper, we address this challenging CL scenario by introducing an innovative memory management approach, by incorporating a dynamic memory system for storing selected representatives from evolving data while a dynamically expandable memory system enables retaining essential long-term knowledge. Specifically, the dynamic expandable memory system manages a series of memory distributions, each designed to represent the information from a distinct data category. We propose a new memory expansion mechanism that assesses the proximity between incoming samples and existing memory distributions, utilizing this evaluation to incrementally add new memory distributions into the system. Additionally, a novel memory distribution augmentation technique is proposed for selectively gathering suitable samples for each memory distribution, enhancing the statistical robustness over time. To prevent memory saturation before the training phase, we introduce a memory distribution reduction strategy that automatically eliminates overlapping memory distributions, ensuring adequate capacity for accommodating new information in subsequent learning episodes. We conduct a series of experiments demonstrating that our proposed approach attains state-of-the-art performance in both …
Poster
Zijian Gao · Wangwang Jia · Xingxing Zhang · Dulan Zhou · Kele Xu · Feng Dawei · Yong Dou · Xinjun Mao · Huaimin Wang
[ ExHall D ]
Abstract
Class-Incremental Learning (CIL) enables models to continuously learn new classes while mitigating catastrophic forgetting. Recently, Pre-Trained Models (PTMs) have greatly enhanced CIL performance, even when fine-tuning is limited to the first task. This advantage is particularly beneficial for CIL methods that freeze the feature extractor after first-task fine-tuning, such as analytic learning-based approaches using a least squares solution-based classification head to acquire knowledge recursively. In this work, we revisit the analytical learning approach combined with PTMs and identify its limitations in adapting to new classes, leading to sub-optimal performance. To address this, we propose the **Mo**mentum-based **A**nalytical **L**earning (**MoAL**) approach. MoAL achieves robust knowledge memorization via an analytical classification head and improves adaptivity to new classes through momentum-based adapter weight interpolation, also leading to forgetting outdated knowledge. Importantly, we introduce a knowledge rumination mechanism that leverages refined adaptivity, allowing the model to revisit and reinforce old knowledge, thereby improving performance on old classes. MoAL facilitates the acquisition of new knowledge and consolidates old knowledge, achieving a win-win outcome between plasticity and stability. Extensive experiments on multiple datasets and incremental settings demonstrate that MoAL significantly outperforms current state-of-the-art methods.
Poster
Arnav Mohanty Das · Gantavya Bhatt · Lilly Kumari · Sahil Verma · Jeff Bilmes
[ ExHall D ]
Abstract
Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime, e.g., few-shot learning. Prior approaches have employed only \emph{nearest-neighbor} based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the training set. However, these approaches are prone to selecting highly redundant samples, since they fail to incorporate any notion of diversity. In our work, we first demonstrate that data selection strategies used in prior retrieval-augmented few-shot learning settings can be generalized using a class of functions known as Combinatorial Mutual Information (CMI) measures. We then propose COBRA (COmBinatorial Retrieval Augmentation), which employs an alternative CMI measure that considers both diversity and similarity to a target dataset. COBRA consistently outperforms previous retrieval approaches across datasets and few-shot learning techniques when used to retrieve samples from LAION-2B. Using COBRA introduces negligible computational overhead to the cost of retrieval, while providing significant gains in downstream model performance
Poster
Da-Wei Zhou · Zi-Wen Cai · Han-Jia Ye · Lijun Zhang · De-Chuan Zhan
[ ExHall D ]
Abstract
Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains. While recent advances in pre-trained models provide a solid foundation for DIL, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge. Specifically, sequential model updates can overwrite both the representation and the classifier with knowledge from the latest domain. Thus, it is crucial to develop a representation and corresponding classifier that accommodate all seen domains throughout the learning process. To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. By merging the backbone of different stages, we create a representation space suitable for multiple domains incrementally. The merged representation serves as a balanced intermediary that captures task-specific features from all seen domains. Additionally, to address the mismatch between consolidated embeddings and the classifier, we introduce an extra classifier consolidation process. Leveraging class-wise semantic information, we estimate the classifier weights of old domains within the latest embedding space. By merging historical and estimated classifiers, we align them with the consolidated embedding space, facilitating incremental classification. Extensive experimental results on four benchmark datasets demonstrate Duct's state-of-the-art performance.
Poster
Aristotelis Ballas · Christos Diou
[ ExHall D ]
Abstract
Domain Generalization (DG) research has gained considerable traction as of late,since the ability to generalize to unseen data distributions is a requirementthat eludes even state-of-the-art training algorithms. In this paper we observethat the initial iterations of model training play a key role in domaingeneralization effectiveness, since the loss landscape may be significantlydifferent across the training and test distributions, contrary to the case ofi.i.d. data. Conflicts between gradients of the loss components of each domainlead the optimization procedure to undesirable local minima that do not capturethe domain-invariant features of the target classes. We propose alleviatingdomain conflicts in model optimization, by iteratively annealing the parametersof a model in the early stages of training and searching for points wheregradients align between domains. By discovering a set of parameter values where gradientsare updated towards the same direction for each data distribution present in thetraining set, the proposed Gradient-Guided Annealing (GGA) algorithm encouragesmodels to seek out minima that exhibit improved robustness against domainshifts. The efficacy of GGA is evaluated on four widely accepted and challengingimage classification domain generalization benchmarks, where its use alone isable to establish highly competitive or even state-of-the-art performance.Moreover, when combined with previously proposed domain-generalizationalgorithms it is able to consistently improve their …
Poster
Xiangyu Chang · Fahim Faisal Niloy · Sk Miraj Ahmed · Srikanth Krishnamurthy · Basak Guler · Ananthram Swami · Samet Oymak · Amit K. Roy-Chowdhury
[ ExHall D ]
Abstract
Incorporating transformer models into edge devices poses a significant challenge due to the computational demands of adapting these large models across diverse applications. Parameter-efficient tuning (PET) methods like LoRA, Adapter, and Visual Prompt Tuning (VPT) allow for targeted adaptation by modifying only small parts of the transformer model. However, adapting to dynamic, unlabeled target distributions at test time remains complex. To address this, we introduce AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments. AdMiT innovates by pre-training a set of PET modules, each optimized for different source distributions or tasks, and dynamically selecting and integrating a sparse subset of relevant modules when encountering a new, few-shot, unlabeled target distribution. This integration leverages Kernel Mean Embedding (KME)-based matching to align the target distribution with relevant source knowledge efficiently, without requiring additional routing networks or hyperparameter tuning. AdMiT achieves adaptation with a single inference step, making it particularly suitable for resource-constrained edge deployments. Furthermore, AdMiT preserves privacy by performing adaptation locally on each edge device, with no data exchange required. Our theoretical analysis establishes guarantees for AdMiT's generalization, while extensive benchmarks demonstrate that AdMiT consistently outperforms other PET methods across a range of tasks, achieving robust and efficient adaptation.
Poster
Hassan Mahmood · Ehsan Elhamifar
[ ExHall D ]
Abstract
Generating targeted universal perturbations for multi-label recognition is a combinatorially hard problem that requires exponential time and space complexity. To address the problem, we propose a compositional framework. We show that a simple independence assumption on label-wise universal perturbations naturally leads to an efficient optimization that requires learning affine convex cones spanned by label-wise universal perturbations, significantly reducing the problem complexity to linear time and space. During inference, the framework allows generating universal perturbations for novel combinations of classes in constant time. We demonstrate the scalability of our method on large datasets and target sizes, evaluating its performance on NUS-WIDE, MS-COCO, and OpenImages using state-of-the-art multi-label recognition models. Our results show that our approach outperforms baselines and achieves results comparable to methods with exponential complexity.
Poster
Tianhao Ma · Han Chen · Juncheng Hu · Yungang Zhu · Ximing Li
[ ExHall D ]
Abstract
Learning from label proportions (LLP), i.e. a challenging weakly-supervised learning task, aims to train a classifier by using bags of instances and the proportions of classes within bags, rather than annotated labels for each instance. Beyond the traditional bag-level loss, the mainstream methodology of LLP is to incorporate an auxiliary instance-level loss with pseudo-labels formed by predictions. Unfortunately, we empirically observed that the pseudo-labels are often inaccurate and even meaningless, especially for the scenarios with large bag sizes, hurting the classifier induction. To alleviate this problem, we suggest a novel LLP method, namely Learning Label Proportions with Auxiliary High-confident Instance-level Loss (L^2P-AHIL). Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. It simultaneously emphasizes accurate predictions at the bag level and avoids smoothing predictions, which tend to be meaningless. We then form high-confident instance-level loss with DEW, and jointly optimize it with the bag-level loss in a self-training manner. The experimental results on benchmark datasets show that L^2P-AHIL can surpass the existing baseline methods, and the performance gain can be more significant as the bag size increases.
Poster
Jae Hyeon Park · Joo Hyeon Jeon · Jae Yun Lee · Sangyeon Ahn · MinHee Cha · Min Geol Kim · Hyeok Nam · Sung In Cho
[ ExHall D ]
Abstract
This study addresses the limitations of existing dynamic pseudo-labeling techniques, which often utilize static or dynamic thresholds for confident sample selection. Traditional methods fail to capture the non-linear relationship between task accuracy and model confidence, particularly in the context of overconfidence, thus limiting learning opportunities for sensitive samples that significantly influence a model's generalization ability. To solve this, we propose a novel gradient pass-based dynamic pseudo-labeling (DPL) technique that incorporates high-entropy samples, which are typically overlooked. Our approach introduces two classifiers—low gradient pass (LGP) and high gradient pass (HGP)—to derive sensitive dynamic thresholds (SDT) and underconfident dynamic thresholds (UDT), respectively. By effectively combining these thresholds with those from converged and overconfident states, we aim to create a more adaptive and effective learning strategy. Our main contributions highlight the importance of considering both low and high-confidence samples in enhancing model robustness and generalization for improved PL performance.
Poster
Erik Wallin · Fredrik Kahl · Lars Hammarstrand
[ ExHall D ]
Abstract
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given label hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the label hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the label hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined label hierarchies and show the effectiveness of our method. Our code is provided as supplementary material.
Poster
Divya M Shanmugam · Helen Lu · Swami Sankaranarayanan · John Guttag
[ ExHall D ]
Abstract
A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.
Poster
Xiangtao Zhang · Sheng Li · Ao Li · Yipeng Liu · Fan Zhang · Ce Zhu · Le Zhang
[ ExHall D ]
Abstract
Heterogeneous Federated Learning (HFL) has received widespread attention due to its adaptability to different models and data. The HFL approach utilizing auxiliary models for knowledge transfer enhances flexibility. However, existing frameworks face the challenges of aggregation bias and local overfitting. To address these issues, we propose FedSCE. It reduces the degree of freedom of update and improves generalization performance by limiting the specific layer of local model update to the local subspace. The subspace is dynamically updated to ensure coverage of the latest model update trajectory. Additionally, FedSCE evaluates client contributions based on the update distance of the auxiliary model in feature space and parameter space, achieving adaptive weighted aggregation. We validate our approach in both feature-skewed and label-skewed scenarios, demonstrating that on the Office10, our method exceeds the best baseline by 3.87. Our source code will be released.
Poster
Gongxi Zhu · Donghao Li · Hanlin Gu · Yuan Yao · Lixin Fan · Yuxing Han
[ ExHall D ]
Abstract
Federated Learning (FL) is a promising approach for training machine learning models on decentralized data while preserving privacy. However, privacy risks, particularly Membership Inference Attacks (MIAs), which aim to determine whether a specific data point belongs to a target client’s training set, remain a significant concern. Existing methods for implementing MIAs in FL primarily analyze updates from the target client, focusing on metrics such as loss, gradient norm, and gradient difference. However, these methods fail to leverage updates from non-target clients, potentially underutilizing available information.In this paper, we first formulate a one-tailed likelihood-ratio hypothesis test based on the likelihood of updates from non-target clients. Building upon this formulation, we introduce a three-stage Membership Inference Attack (MIA) method, called FedMIA, which follows the "all for one"—leveraging updates from all clients across multiple communication rounds to enhance MIA effectiveness. Both theoretical analysis and extensive experimental results demonstrate that FedMIA outperforms existing MIAs in both classification and generative tasks. Additionally, it can be integrated as an extension to existing methods and is robust against various defense strategies, Non-IID data, and different federated structures.
Poster
Jiahao Xu · Zikai Zhang · Rui Hu
[ ExHall D ]
Abstract
The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model’s performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. Code is available at \url{https://anonymous.4open.science/r/AlignIns}.
Poster
Fan Xing · Zhuo Tian · Xuefeng Fan · Xiaoyi Zhou
[ ExHall D ]
Abstract
Reversible Adversarial Examples (RAE) are designed to protect the intellectual property of datasets. Such examples can function as imperceptible adversarial examples to erode the model performance of unauthorized users while allowing authorized users to remove the adversarial perturbations and recover the original samples for normal model training. With the rise of Self-Supervised Learning (SSL), an increasing number of unlabeled datasets and pre-trained encoders are available in the community. However, existing RAE methods not only rely on well-labeled datasets for training Supervised Learning (SL) models but also exhibit poor adversarial transferability when attacking SSL pre-trained encoders. To address these challenges, we propose RAEncoder, the first framework for RAEs without the need for labeled samples. RAEncoder aims to generate universal adversarial perturbations by targeting SSL pre-trained encoders. Unlike traditional RAE approaches, the pre-trained encoder outputs the feature distribution of the protected dataset rather than classification labels, enhancing both the attack success rate and transferability of RAEs. Extensive experiments are conducted on six pre-trained encoders and four SL models, covering aspects such as imperceptibility and transferability. Our results demonstrate that RAEncoder effectively protects unlabeled datasets from malicious infringements. Additional robustness experiments further confirm the security of RAEncoder in practical application scenarios.
Poster
Sizai Hou · Songze Li · Duanyi Yao
[ ExHall D ]
Abstract
Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders mismatch triggered inputs with target embeddings, e.g., match the triggered cat input to an airplane embedding, such that the downstream tasks are affected to misbehave when the trigger is activated. Emerging backdoor attacks have shown great threats in different SSL paradigms such as contrastive learning and CLIP, while few research is devoted to defending against such attacks. Besides, the existing ones fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DEDE, which detects the activation of the backdoor mapping with the cooccurrence of victim encoder and trigger inputs. Specifically, DEDE trains a decoder for the SSL encoder on an auxiliary dataset (can be out-of-distribution or even slightly poisoned), such that for any triggered input that misleads to the target embedding, the decoder outputs an image significantly different from the input. We empirically evaluate DEDE on both contrastive learning and CLIP models against various types of backdoor attacks, and demonstrate its superior performance …
Poster
Shixin Li · Chaoxiang He · Xiaojing Ma · Bin Benjamin Zhu · Shuo Wang · Hongsheng Hu · Dongmei Zhang · Linchen Yu
[ ExHall D ]
Abstract
Adversarial attacks threaten the integrity of deep neural networks (DNNs), particularly in high-stakes applications. This paper explores an innovative black-box adversarial attack strategy leveraging checkpoints from a single model’s training trajectory. Unlike traditional ensemble attacks that require multiple surrogate models of different architectures, our approach utilizes a single model’s diverse training checkpoints to craft adversarial examples. By categorizing the knowledge learned during training into task-intrinsic and task-irrelevant knowledge, we identify checkpoints that predominantly capture task-intrinsic knowledge, which generalizes across different models. We introduce an accuracy gap-based selection strategy to enhance the transferability of adversarial examples to models with different architectures. Extensive experiments on benchmark datasets, including ImageNet and CIFAR-10, demonstrate that our method consistently outperforms traditional model ensemble attacks in terms of transferability. Furthermore, our approach remains highly effective even with significantly reduced training data, offering a practical and resource-efficient solution for highly transferable adversarial attacks.
Poster
Yuan Xiao · Yuchen Chen · Shiqing Ma · Chunrong Fang · Tongtong Bai · Mingzheng Gu · Yuxin Cheng · Yanwei Chen · Zhenyu Chen
[ ExHall D ]
Abstract
The robustness of neural network classifiers is important in the safety-critical domain and can be quantified by robustness verification. At present, efficient and scalable verification techniques are always sound but incomplete, and thus, the improvement of verified robustness results is the key criterion to evaluate the performance of incomplete verification approaches. The multi-variate function MaxPool is widely adopted yet challenging to verify. In this paper, we present \textbf{Ti-Lin}, a robustness verifier for MaxPool-based CNNs with \textbf{Ti}ght \textbf{Lin}ear Approximation. Following the sequel of minimizing the over-approximation zone of the non-linear function of CNNs, we are the first to propose the provably neuron-wise tightest linear bounds for the MaxPool function. By our proposed linear bounds, we can certify larger robustness results for CNNs. We evaluate the effectiveness of Ti-Lin on different verification frameworks with open-sourced benchmarks, including LeNet, PointNet, and networks trained on the MNIST, CIFAR-10, Tiny ImageNet and ModelNet40 datasets. Experimental results show that Ti-Lin significantly outperforms the state-of-the-art methods across all networks with up to 78.6\% improvement in terms of the certified accuracy with almost the same time consumption as the fastest tool. Our code is available at \url{https://anonymous.4open.science/r/Ti-Lin-cvpr-72EE}.
Poster
Quanjiang Li · Tingjin Luo · Jiahui Liao
[ ExHall D ]
Abstract
Incomplete features and label noise in multi-view multi-label data significantly undermine the reliability and performance, motivating researchers to explore the mechanism of representation and information recovery. However, learning for such dual deficiencies is crucial but rarely studied. In this paper, we propose a theory-inspired Deep Multi-View Multi-Label Learning method with Incomplete Views and Noisy Labels named DMMIvNL to address these problems.Specifically, to promote the synthesis of task-relevant shared information and preserve the distinctiveness of individual features from limited views, we have developed a feature extraction modular based on the information bottleneck theory, and formulated its theoretical upper bound into its objective.Meanwhile, we theoretically prove that minimizing the volume of the transition matrix ensures the statistical consistency with classifier training. Besides, a cycle-consistent estimation principle is proposed in the volume minimization network to improve the recognition stability of multi-label noise. Moreover, leveraging inherent real semantics information and label correlations are employed as model regularization to reduce the risk of excessive noise fitting.Finally, extensive experimental results validate the effectiveness and robustness of our DMMIvNL.
Poster
Baili Xiao · Zhibin Dong · KE LIANG · Suyuan Liu · Siwei Wang · Tianrui Liu · Xingchen Hu · En Zhu · Xinwang Liu
[ ExHall D ]
Abstract
Multi-view clustering represents one of the most established paradigms within the field of unsupervised learning and has witnessed a surge in popularity in recent years. View-pair form contrastive learning allows for consistently representing multiple views by maximizing mutual information between each two views. This approach permits multi-view clustering to discern consistent latent representations across multiple views. However, two significant issues emerge when this approach is considered. i)it is challenging to ascertain which two views are most appropriate for contrastive learning when there are more than three views, particularly without prior knowledge. ii) when all views are included in contrastive learning, multi-view clustering performance is compromised by poor quality views. To tackle these issues, we present a novel Efficient Dual Selection Mechanism for deep Multi-View Clustering framework, termed EASEMVC. Specifically, EASEMVC first constructs a view graph based on the OT distance between the bipartite graph of each view. It then designs a view selection module to realize an efficient view-level selection process through the view topology relations in the view graph structure. Additionally, a cross-view sample graph structure is constructed at the sample level, with the sample topological relations in the cross-view sample graph structure being employed to generate reliable sample …
Poster
Jiyuan Liu · Xinwang Liu · chuankun Li · Xinhang Wan · Hao Tan · Yi Zhang · Weixuan Liang · Qian Qu · Yu Feng · Renxiang Guan · KE LIANG
[ ExHall D ]
Abstract
Multi-view clustering is a long-standing hot topic in machine learning communities, due to its capability of integrating data information from multiple sources and modalities. By utilizing tensor Singular Value Decomposition (t-SVD) technique with the tensor rotation trick, recent advances have achieved remarkable improvements on clustering performance. However, we find this is attributed to the inadvertent use of sequential information of sorted data samples, i.e. inadvertent label use, which violates the unsupervised learning setting. On the other hand, existing large-scale approaches are mostly developed on the basis of matrix factorization or anchor techniques, thereby fail to consider the similarities among all data samples, preventing from further performance improvement. To address the above issues, we first analyze the tensor rotation trick and recommend to remove it from tensor clustering. On its basis, a novel large-scale multi-view tensor clustering method is developed by incorporating the pair-wise similarities with implicit linear kernel function. To solve the resultant optimization problem, we design an efficient algorithm of linear complexity. Moreover, extensive experiments are conducted and corresponding results well support the aforementioned finding and validate the effectiveness and efficiency of the proposed method.
Poster
JungKyoo Shin · Bumsoo Kim · Eunwoo Kim
[ ExHall D ]
Abstract
Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more flexible alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.
Poster
Siyuan Duan · Yuan Sun · Dezhong Peng · Zheng Liu · Xiaomin Song · Peng Hu
[ ExHall D ]
Abstract
Cross-modal retrieval aims to match related samples across distinct modalities, facilitating the retrieval and discovery of heterogeneous information. Although existing methods show promising performance, most are deterministic models and are unable to capture the uncertainty inherent in the retrieval outputs, leading to potentially unreliable results. To address this issue, we propose a novel framework called FUzzy Multimodal lEarning (FUME), which is able to self-estimate epistemic uncertainty, thereby embracing trusted cross-modal retrieval. Specifically, our FUME leverages the Fuzzy Set Theory to view the outputs of the classification network as a set of membership degrees and quantify category credibility by incorporating both possibility and necessity measures. However, directly optimizing the category credibility could mislead the model by over-optimizing the necessity for unmatched categories. To overcome this challenge, we present a novel fuzzy multimodal learning strategy, which utilizes label information to guide necessity optimization in the right direction, thereby indirectly optimizing category credibility and achieving accurate decision uncertainty quantification. Furthermore, we design an uncertainty merging scheme that accounts for decision uncertainties, thus further refining uncertainty estimates and boosting the trustworthiness of retrieval results. Extensive experiments on five benchmark datasets demonstrate that FUME remarkably improves both retrieval performance and reliability, offering a prospective solution …
Poster
Tal Zeevi · Ravid Shwartz-Ziv · Yann LeCun · Lawrence Staib · John A Onofrey
[ ExHall D ]
Abstract
Accurate uncertainty estimation is crucial for deploying neural networks in risk-sensitive applications such as medical diagnosis. Monte Carlo Dropout is a widely used technique for approximating predictive uncertainty by performing stochastic forward passes with dropout during inference. However, using static dropout rates across all layers and inputs can lead to suboptimal uncertainty estimates, as it fails to adapt to the varying characteristics of individual inputs and network layers. Existing approaches optimize dropout rates during training using labeled data, resulting in fixed inference-time parameters that cannot adjust to new data distributions, compromising uncertainty estimates in Monte Carlo simulations.In this paper, we propose Rate-In, an algorithm that dynamically adjusts dropout rates during inference by quantifying the information loss induced by dropout in each layer's feature maps. By treating dropout as controlled noise injection and leveraging information-theoretic principles, Rate-In adapts dropout rates per layer and per input instance without requiring ground truth labels. By quantifying the functional information loss in feature maps, we adaptively tune dropout rates to maintain perceptual quality across diverse medical imaging tasks and architectural configurations. Our empirical results on synthetic data and real-world medical imaging tasks demonstrate that Rate-In improves calibration and sharpens uncertainty estimates compared to fixed or …
Poster
Divya Velayudhan · Abdelfatah Ahmed · Mohamad Alansari · Neha Gour · Abderaouf Behouch · Taimur Hassan · Syed Talal Wasim · Nabil Maalej · Muzammal Naseer · Jürgen Gall · Mohammed Bennamoun · Ernesto Damiani · Naoufel Werghi
[ ExHall D ]
Abstract
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Our code, data, and pre-trained models will be made publicly available.
Poster
Yang Yue · Yulin Wang · Chenxin Tao · Pan Liu · Shiji Song · Gao Huang
[ ExHall D ]
Abstract
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models will be …
Poster
Jianwei Zhao · XIN LI · Fan Yang · Qiang Zhai · Ao Luo · Yang Zhao · Hong Cheng · Huazhu Fu
[ ExHall D ]
Abstract
Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks—Camelyon16, TCGA-NSCLC, and BRACS—consistently achieving state-of-the-art performance in both binary and multi-class tasks. The model and code will be made publicly available upon acceptance.
Poster
Xianrui Li · Yufei Cui · Jun Li · Antoni B. Chan
[ ExHall D ]
Abstract
Advances in medical imaging and deep learning have propelled progress in whole slide image (WSI) analysis, with multiple instance learning (MIL) showing promise for efficient and accurate diagnostics. However, conventional MIL models often lack adaptability to evolving datasets, as they rely on static training that cannot incorporate new information without extensive retraining. Applying continual learning (CL) to MIL models is a possible solution, but often sees limited improvements. In this paper, we analyze CL in the context of attention MIL models and find that the model forgetting is mainly concentrated in the attention layers of the MIL model. Using the results of this analysis we propose two components for improving CL on MIL:Attention Knowledge Distillation (AKD) and the Pseudo-Bag Memory Pool (PMP). AKD mitigates catastrophic forgetting by focusing on retaining attention layer knowledge between learning sessions, while PMP reduces the memory footprint by selectively storing only the most informative patches, or ''pseudo-bags'' from WSIs. Experimental evaluations demonstrate that our method significantly improves both accuracy and memory efficiency on diverse WSI datasets, outperforming current state-of-the-art CL methods. This work provides a foundation for CL in large-scale, weakly annotated clinical datasets, paving the way for more adaptable and resilient diagnostic models.
Poster
Hang Shi · Chi Changxi · Peng Wan · Daoqiang Zhang · WEI SHAO
[ ExHall D ]
Abstract
The rapid development of spatial transcriptomics (ST) allows researchers to measure the spatial-level gene expression in tissues. Although powerful, the cost for collecting the ST data is expensive, and thus several studies aim to predict gene expression in ST by utilizing their corresponding H/E stained pathology images. The existing ST based gene expression prediction models either adopt the pre-trained networks or rely on the handcrafted features to describe the pathology images, which still lack a systematic way to combine them together to define a spot-level representation that can reflect the topological profiles of different spots. On the other hand, all the ST based gene prediction models treat the prediction task for each gene independently, which overlook the fact that the exploration of potential interrelationships among them can help improve the prediction performance for individual genes.To address the above issues, we propose a multi-modal topology-embedded graph learning algorithm guided by prior Gene Ontology similarity information (i.e., M2TGLGO) to predict the spatial resolved genes from pathology image. Specifically, M2TGLGO co-learns the image representation of different spots from both deep and handcrafted features by considering the within-modal and inter-modal interactions. Next, to keep the topological structure among different spots, a spatial-oriented ranking module …
Poster
Marcus Nordström · Atsuto Maki · Henrik Hult
[ ExHall D ]
Abstract
In image segmentation and specifically in medical image segmentation, the soft-Dice loss is often chosen instead of the more traditional cross-entropy loss to improve performance with respect to the Dice metric.Experimental work supporting this claim exists, but how and why the two loss functions lead to different predictions is not well understood.This paper explains the observed discrepancy as a consequence of how those loss functions are effected by label noise and what threshold is used to convert the predicted soft segmentation to predicted labels.In particular, it is shown (i) how the optimal solutions to the two loss functions diverge as the noise is increased, and (ii) how the optimal solutions to soft-Dice can be recovered by thresholding the solutions to cross-entropy with an a priori unknown but efficiently computable threshold.The theoretical results are supported by numerical experiments and it is concluded that cross-entropy with the alternative threshold yields the stability and informative label probability maps associated with cross-entropy without sacrificing the performance of soft-Dice.
Poster
Yunhe Gao · Di Liu · Zhuowei Li · Yunsheng Li · Dongdong Chen · Mu Zhou · Dimitris N. Metaxas
[ ExHall D ]
Abstract
Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present \textbf{Iris}, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. Given a decoupled architecture on 3D data processing, Iris supports diverse inference strategies including one-shot inference, context example ensemble, object-level context example retrieval, and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to specialized supervised models on in-distribution tasks. On seven held-out dataset, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris's task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into cross-modality medical objects without explicit anatomical supervision.
Poster
Junlong Cheng · Bin Fu · Jin Ye · Guoan Wang · Tianbin Li · Haoyu Wang · Ruoyu Li · He Yao · Chen Junren · Jingwen Li · Yanzhou Su · Min Zhu · Junjun He
[ ExHall D ]
Abstract
Interactive Medical Image Segmentation (IMIS) has long been constrained by the limited availability of large-scale, diverse, and densely annotated datasets, which hinders model generalization and consistent evaluation across different models. In this paper, we introduce the IMed-361M benchmark dataset, a significant advancement in general IMIS research. First, we collect and standardize over 6.4 million medical images and their corresponding ground truth masks from multiple data sources. Then, leveraging the strong object recognition capabilities of a vision foundational model, we automatically generated dense interactive masks for each image and ensured their quality through rigorous quality control and granularity management. Unlike previous datasets, which are limited by specific modalities or sparse annotations, IMed-361M spans 14 modalities and 204 segmentation targets, totaling 361 million masks—an average of 56 masks per image. Finally, we developed an IMIS baseline network on this dataset that supports high-quality mask generation through interactive inputs, including clicks, bounding boxes, text prompts, and their combinations. We evaluate its performance on medical image segmentation tasks from multiple perspectives, demonstrating superior accuracy and scalability compared to existing interactive segmentation models. To facilitate research on foundational models in medical computer vision, we release the IMed-361M and model at https://anonymous.4open.science/r/IMIS-Bench-FF8B.
Poster
Yanfeng Zhou · Lingrui Li · Le Lu · Minfeng Xu
[ ExHall D ]
Abstract
Semantic segmentation is a crucial prerequisite in clinical applications and computer-aided diagnosis. With the development of deep neural networks, biomedical image segmentation has achieved remarkable success. Encoder-decoder architectures that integrate convolutions and transformers are gaining attention for their potential to capture both global and local features. However, current designs face the contradiction that these two features cannot be continuously transmitted. In addition, some models lack a unified and standardized evaluation benchmark, leading to significant discrepancies in the experimental setup. In this study, we review and summarize these architectures and analyze their contradictions in design. We modify UNet and propose WNet to combine transformers and convolutions, addressing the transmission issue effectively. WNet captures long-range dependencies and local details simultaneously while ensuring their continuous transmission and multi-scale fusion. We integrate WNet into the nnUNet framework for unified benchmarking. Our model achieves state-of-the-art performance in biomedical image segmentation. Extensive experiments demonstrate their effectiveness on four 2D datasets (DRIVE, ISIC-2017, Kvasir-SEG, and CREMI) and four 3D datasets (Parse2022, AMOS22, BTCV, and ImageCAS). The code is available at https://github.com/XXX.
Poster
Yufan He · Pengfei Guo · Yucheng Tang · Andriy Myronenko · Vishwesh Nath · Ziyue Xu · Dong Yang · Can Zhao · Benjamin D. Simon · Mason Belue · Stephanie Anne Harmon · Baris Turkbey · Daguang Xu · Wenqi Li
[ ExHall D ]
Abstract
Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all thesechallenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting …
Poster
Bastian Wittmann · Yannick Wattenberg · Tamaz Amiranashvili · Suprosanna Shit · Bjoern Menze
[ ExHall D ]
Abstract
Segmenting 3D blood vessels is a critical yet challenging task in medical image analysis. This is due to significant imaging modality-specific variations in artifacts, vascular patterns and scales, signal-to-noise ratios, and background tissues. These variations, along with domain gaps arising from varying imaging protocols, limit the generalization of existing supervised learning-based methods, requiring tedious voxel-level annotations for each dataset separately. While foundation models promise to alleviate this limitation, they typically fail to generalize to the task of blood vessel segmentation, posing a unique, complex problem. In this work, we present vesselFM, a foundation model designed specifically for the broad task of 3D blood vessel segmentation. Unlike previous models, vesselFM can effortlessly generalize to unseen domains. To achieve zero-shot generalization, we train vesselFM on three heterogeneous data sources: a large, curated annotated dataset, data generated by a domain randomization scheme, and data sampled from a flow matching-based generative model. Extensive evaluations show that vesselFM outperforms state-of-the-art medical image segmentation foundation models across four (pre-)clinically relevant imaging modalities in zero-, one-, and few-shot scenarios, therefore providing a universal solution for 3D blood vessel segmentation.
Poster
zhuangzhuang chen · hualiang wang · Chubin Ou · Xiaomeng Li
[ ExHall D ]
Abstract
Optical coherence tomography angiography (OCTA) shows its great importance in imaging microvascular networks by providing accurate 3D imaging of blood vessels, but it relies upon specialized sensors and expensive devices. For this reason, previous works show the potential to translate the readily available 3D Optical Coherence Tomography (OCT) images into 3D OCTA images. However, existing OCTA translation methods directly learn the mapping from the OCT domain to the OCTA domain in continuous and infinite space with guidance from only a single view, i.e., the OCTA project map, resulting in suboptimal reconstruction results. To this end, we propose the multi-view Tri-alignment framework for OCT to OCTA 3D image translation in discrete and finite space, named \emph{MuTri}. In the first stage, we pre-train two vector-quantized variational auto-encoder (VQVAE) via the reconstruction of 3D OCT and 3D OCTA data, providing semantic prior for subsequent multi-view guidances. In the second stage, our multi-view tri-alignment facilitates another VQVAE model to learn the mapping from the OCT domain to the OCTA domain in discrete and finite space. Specifically, a contrastive-inspired semantic alignment is proposed to maximize the mutual information with the pre-trained models from OCT and OCTA views, to facilitate codebook learning. Meanwhile, a vessel structure …