Processing math: 100%
Skip to yearly menu bar Skip to main content


Timezone: America/Chicago

Registration Desk: Registration / Badge Pickup Fri 13 Jun 07:00 a.m.  


Remarks: Welcome & Awards Fri 13 Jun 08:30 a.m.  


Oral Session 1A: Image and Video Synthesis Fri 13 Jun 09:00 a.m.  

Oral
Daniel Geng · Charles Herrmann · Junhwa Hur · Forrester Cole · Serena Zhang · Tobias Pfaff · Tatiana Lopez-Guevara · Yusuf Aytar · Michael Rubinstein · Chen Sun · Oliver Wang · Andrew Owens · Deqing Sun

[ Karl Dean Ballroom ]

Abstract
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse _or_ dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as _motion prompts_. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term _motion prompt expansion_. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance.
Oral
Ryan Burgert · Yuancheng Xu · Wenqi Xian · Oliver Pilarski · Pascal Clausen · Mingming He · Li Ma · Yitong Deng · Lingxiao Li · Mohsen Mousavi · Michael Ryoo · Paul Debevec · Ning Yu

[ Karl Dean Ballroom ]

Abstract
Generative modeling aims to transform chaotic noise into structured outputs that align with training data distributions. In this work, we enhance video diffusion generative models by introducing motion control as a structured component within latent space sampling. Specifically, we propose a novel real-time noise warping method that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, enabling fine-grained motion control independent of model architecture and guidance type. We fine-tune modern video diffusion base models and provide a unified paradigm for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. By leveraging a real-time noise-warping algorithm that preserves spatial Gaussianity while efficiently maintaining temporal consistency, we enable flexible and diverse motion control applications with minimal trade-offs in pixel quality and temporal coherence. Extensive experiments and user studies demonstrate the advantages of our method in terms of visual quality, motion controllability, and temporal consistency, making it a robust and scalable solution for motion-controllable video synthesis.
Oral
Pascal Chang · Sergio Sancho · Jingwei Tang · Markus Gross · Vinicius C. Azevedo

[ Karl Dean Ballroom ]

Abstract
Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce _Laplacian Pyramid Warping_, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams [Geng et al. 2024] to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.
Oral
Yifan Zhou · Zeqi Xiao · Shuai Yang · Xingang Pan

[ Karl Dean Ballroom ]

Abstract
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.
Oral
Ziqi Pang · Tianyuan Zhang · Fujun Luan · Yunze Man · Hao Tan · Kai Zhang · William Freeman · Yu-Xiong Wang

[ Karl Dean Ballroom ]

Abstract
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enabling random order is to insert a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports in-painting, outpainting and resolution extrapolation in a zero-shot manner.We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios.

Oral Session 1B: Interpretability and Evaluation Fri 13 Jun 09:00 a.m.  

Oral
Pengfei Zhou · Xiaopeng Peng · Jiajun Song · Chuanhao Li · Zhaopan Xu · Yue Yang · Ziyao Guo · Hao Zhang · Yuqi Lin · Yefei He · Lirui Zhao · Shuo Liu · Tianhua Li · Yuxuan Xie · Xiaojun Chang · Yu Qiao · Wenqi Shao · Kaipeng Zhang

[ ExHall A2 ]

Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The benchmark, code and judge models will be released.
Oral
Faridoun Mehri · Mahdieh Baghshah · Mohammad Taher Pilehvar

[ ExHall A2 ]

Abstract
Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad—a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods—including Transformer-specific approaches—across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models—two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available.
Oral
Damien Teney · Liangze Jiang · Florin Gogianu · Ehsan Abbasnejad

[ ExHall A2 ]

Abstract
Common choices of architecture give neural networks a preference for fitting data with simple functions. This simplicity bias is known as key to their success. This paper explores the limits of this assumption. Building on recent work that showed that activation functions are the origin of the simplicity bias (Teney, 2024), we introduce a method to meta-learn activation functions to modulate this bias.**Findings.** We discover multiple tasks where the assumption of simplicity is inadequate, and standard ReLU architectures are therefore suboptimal. In these cases, we find activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias proves adequate on image tasks, where learned activations are nearly identical to ReLUs and GeLUs.**Implications.** (1) Contrary to common belief, the simplicity bias is not universally useful. There exist real tasks where it is suboptimal. (2) The suitability of ReLU models for image classification is not accidental. (3) The success of ML ultimately depends on the adequacy between data and architectures, and there may be benefits for architectures tailored to specific distributions of …
Oral
Matt Deitke · Christopher Clark · Sangho Lee · Rohun Tripathi · Yue Yang · Jae Sung Park · Reza Salehi · Niklas Muennighoff · Kyle Lo · Luca Soldaini · Jiasen Lu · Taira Anderson · Erin Bransom · Kiana Ehsani · Huong Ngo · Yen-Sung Chen · Ajay Patel · Mark Yatskar · Chris Callison-Burch · Andrew Head · Rose Hendrix · Favyen Bastani · Eli VanderBilt · Nathan Lambert · Yvonne Chou · Arnavi Chheda-Kothary · Jenna Sparks · Sam Skjonsberg · Michael Schmitz · Aaron Sarnat · Byron Bischoff · Pete Walsh · Christopher Newell · Piper Wolters · Tanmay Gupta · Kuo-Hao Zeng · Jon Borchardt · Dirk Groeneveld · Crystal Nam · Sophie Lebrecht · Caitlin Wittlif · Carissa Schoenick · Oscar Michel · Ranjay Krishna · Luca Weihs · Noah A. Smith · Hannaneh Hajishirzi · Ross Girshick · Ali Farhadi · Aniruddha Kembhavi

[ ExHall A2 ]

Abstract
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present \textbf{Molmo}, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets, including a dataset of highly detailed image captions for pre-training called \textbf{PixMo}, a free-form image Q\&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code will all be released.
Oral
Xiao Guo · Xiufeng Song · Yue Zhang · Xiaohong Liu · Xiaoming Liu

[ ExHall A2 ]

Abstract
Deepfake detection is a long-established research topic crucial for combating the spread of malicious misinformation. Unlike previous methods that provide either binary classification results or textual explanations for deepfake detection, we propose a novel method that delivers both simultaneously. Our method harnesses the multi-modal learning power of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and interpretability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs specially designed face forgery prompt learning, integrating zero-shot learning capabilities of the pre-trained CLIP to improve generalization to unseen forgeries.Also, M2F2-Det incorporates the LLM to provide detailed explanations for detection decisions, offering strong interpretability by bridging the gap between natural language and the subtle nuances of facial forgery detection. Empirically, we evaluate M2F2-Det for both detection and sentence generation tasks, on both of which M2F2-Det achieves state-of-the-art performance, showing its effectiveness in detecting and explaining diverse and unseen forgeries. Code and models will be released upon publication.

Oral Session 1C: Image Processing and Deep Architectures Fri 13 Jun 09:00 a.m.  

Oral
Nick Stracke · Stefan Andreas Baumann · Kolja Bauer · Frank Fundel · Björn Ommer

[ Davidson Ballroom ]

Abstract
Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
Oral
Meng Lou · Yizhou Yu

[ Davidson Ballroom ]

Abstract
In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel **Cont**ext-**Mix**ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing …
Oral
Longyu Yang · Ping Hu · Shangbo Yuan · Lu Zhang · Jun Liu · Heng Tao Shen · Xiaofeng Zhu

[ Davidson Ballroom ]

Abstract
Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to process geometric and reflectance features independently initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.
Oral
Xiaoyi Liu · Hao Tang

[ Davidson Ballroom ]

Abstract
We introduce DiffFNO, a novel framework for arbitrary-scale super-resolution that incorporates a Weighted Fourier Neural Operator (WFNO) enhanced by a diffusion process. DiffFNO's adaptive mode weighting mechanism in the Fourier domain effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details that are essential for super-resolution tasks.Additionally, we propose a Gated Fusion Mechanism to efficiently integrate features from the WFNO and an attention-based neural operator, enhancing the network's capability to capture both global and local image details. To further improve efficiency, DiffFNO employs a deterministic ODE sampling strategy called the Adaptive Time-step ODE Solver (AT-ODE), which accelerates inference by dynamically adjusting step sizes while preserving output quality.Extensive experiments demonstrate that DiffFNO achieves state-of-the-art results, outperforming existing methods across various scaling factors, including those beyond the training distribution, by a margin of 2–4 dB in PSNR. Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.
Oral
Eric Kee · Adam Pikielny · Kevin Blackburn-Matzen · Marc Levoy

[ Davidson Ballroom ]

Abstract
We describe a system to remove real-world reflections from images for consumer photography. Our system operates on linear (RAW) photos, and accepts an optional contextual photo looking in the opposite direction (e.g., the "selfie" camera on a mobile device). This optional photo helps disambiguate what should be considered the reflection. The system is trained solely on synthetic mixtures of real-world RAW images, which we combine using a reflection simulation that is photometrically and geometrically accurate. Our system comprises a base model that accepts the captured photo and optional context photo as input, and runs at 256p, followed by an up-sampling model that transforms 256p images to full resolution. The system can produce images for review at 1K in 4.5 to 6.5 seconds on a MacBook or iPhone 14 Pro. We test on RAW photos that were captured in the field and embody typical consumer photos, and show that our RAW-image simulation yields SOTA performance.

Poster Session 1 Fri 13 Jun 10:30 a.m.  

Poster
Zhedong Zhang · Liang Li · Chenggang Yan · Chunshan Liu · Anton van den Hengel · Yuankai Qi

[ ExHall D ]

Abstract
Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker’s voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at https://A2P-Dubber.github.io/.
Poster
Hao Li · Ju Dai · Xin Zhao · Feng Zhou · Junjun Pan · Lei Li

[ ExHall D ]

Abstract
In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module—Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. The source code will be publicly available.
Poster
Xiaozhong Ji · Xiaobin Hu · Zhihong Xu · Junwei Zhu · Chuming Lin · Qingdong He · Jiangning Zhang · Donghao Luo · Yi Chen · Qin Lin · qinglin lu · Chengjie Wang

[ ExHall D ]

Abstract
The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal inconsistencies.Considering the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to shift focus on the exploration of global audio perception. To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall perception. For the intra-clip audio perception, 1). Context-enhanced audio learning, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). Motion-decoupled Controller, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips …
Poster
Xuanchen Li · Jianyu Wang · Yuhao Cheng · Yikun Zeng · Xingyu Ren · Wenhan Zhu · Weiming Zhao · Yichao Yan

[ ExHall D ]

Abstract
Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset TexTalk4D, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework TexTalker to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements.
Poster
Tim Büchner · Christoph Anders · Orlando Guntinas-Lichius · Joachim Denzler

[ ExHall D ]

Abstract
The relationship between muscle activity and resulting facial expressions is crucial for various fields, including psychology, medicine, and entertainment.The synchronous recording of facial mimicry and muscular activity via surface electromyography (sEMG) provides a unique window into these complex dynamics.Unfortunately, existing methods for facial analysis cannot handle electrode occlusion, rendering them ineffective.Even with occlusion-free reference images of the same person, variations in expression intensity and execution are unmatchable.Our electromyography-informed facial expression reconstruction (EIFER) approach is a novel method to restore faces under sEMG occlusion faithfully in an adversarial manner.We decouple facial geometry and visual appearance (e.g., skin texture, lighting, electrodes) by combining a 3D Morphable Model (3DMM) with neural unpaired image-to-image translation via reference recordings.Then, EIFER learns a bidirectional mapping between 3DMM expression parameters and muscle activity, establishing correspondence between the two domains. We validate the effectiveness of our approach through experiments on a dataset of synchronized sEMG recordings and facial mimicry, demonstrating faithful geometry and appearance reconstruction.Further, we synthesize expressions based on muscle activity and how observed expressions can predict dynamic muscle activity.Consequently, EIFER introduces a new paradigm for facial electromyography, which could be extended to other forms of multi-modal face recordings.
Poster
Mingtao Guo · Guanyu Xing · Yanli Liu

[ ExHall D ]

Abstract
Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.
Poster
Tuur Stuyck · Gene Wei-Chin Lin · Egor Larionov · Hsiaoyu Chen · Aljaž Božič · Nikolaos Sarafianos · Doug Roble

[ ExHall D ]

Abstract
Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hair styles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method's effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting 1000 grooms in 0.3 seconds. Code will be released.
Poster
Weiqi Feng · Dong Han · Zekang Zhou · Shunkai Li · Xiaoqiang Liu · Pengfei Wan · Di ZHANG · Miao Wang

[ ExHall D ]

Abstract
Existing radiance field-based head avatar methods have mostly relied on pre-computed explicit priors (e.g., mesh, point) or neural implicit representations, making it challenging to achieve high fidelity with both computational efficiency and low memory consumption. To overcome this, we present GPAvatar, a novel and efficient Gaussian splatting-based method for reconstructing high-fidelity dynamic 3D head avatars from monocular videos. We extend Gaussians in 3D space to a high-dimensional embedding space encompassing Gaussian's spatial position and avatar expression, enabling the representation of the head avatar with arbitrary pose and expression. To enable splatting-based rasterization, a linear transformation is learned to project each high-dimensional Gaussian back to the 3D space, which is sufficient to capture expression variations instead of using complex neural networks. Furthermore, we propose an adaptive densification strategy that dynamically allocates Gaussians to regions with high expression variance, improving the facial detail representation. Experimental results on three datasets show that our method outperforms existing state-of-the-art methods in rendering quality and speed while reducing memory usage in training and rendering.
Poster
Hongrui Cai · Yuting Xiao · Xuan Wang · Jiafei Li · Yudong Guo · Yanbo Fan · Shenghua Gao · Juyong Zhang

[ ExHall D ]

Abstract
We introduce a novel approach to creating ultra-realistic head avatars and rendering them in real time (30 fps at 2048×1334 resolution). First, we propose a hybrid explicit representation that combines the advantages of two primitive based efficient rendering techniques. UV-mapped 3D mesh is utilized to capture sharp and rich textures on smooth surfaces, while 3D Gaussian Splatting is employed to represent complex geometric structures. In the pipeline of modeling an avatar, after tracking parametric models based on captured multi-view RGB videos, our goal is to simultaneously optimize the texture and opacity map of mesh, as well as a set of 3D Gaussian splats localized and rigged onto the mesh facets. Specifically, we perform α-blending on the color and opacity values based on the merged and re-ordered z-buffer from the rasterization results of mesh and 3DGS. This process involves the mesh and 3DGS adaptively fitting the captured visual information to outline a high-fidelity digital avatar. To avoid artifacts caused by Gaussian splats crossing the mesh facets, we design a stable hybrid depth sorting strategy. Experiments illustrate that our modeled results exceed those of state-of-the-art approaches.
Poster
Jack Saunders · Charlie Hewitt · Yanan Jian · Marek Kowalski · Tadas Baltrusaitis · Yiye Chen · Darren Cosker · Virginia Estellers · Nicholas Gydé · Vinay P. Namboodiri · Benjamin E Lundell

[ ExHall D ]

Abstract
Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or videoand fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360 rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware.
Poster
Rong Wang · Fabian Prada · Ziyan Wang · Zhongshi Jiang · Chengxiang Yin · Junxuan Li · Shunsuke Saito · Igor Santesteban · Javier Romero · Rohan Joshi · Hongdong Li · Jason Saragih · Yaser Sheikh

[ ExHall D ]

Abstract
We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually …
Poster
Jingyu Zhuang · Di Kang · Linchao Bao · Liang Lin · Guanbin Li

[ ExHall D ]

Abstract
Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
Poster
Zedong Chu · Feng Xiong · Meiduo Liu · Jinzhi Zhang · Mingqi Shao · Zhaoxu Sun · Di Wang · Mu Xu

[ ExHall D ]

Abstract
With the rapid evolution of 3D generation algorithms, the cost of producing 3D humanoid character models has plummeted, yet the field is impeded by the lack of a comprehensive dataset for automatic rigging—a pivotal step in character animation. Addressing this gap, we present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging, encompassing 11,434 meticulously curated T-posed meshes adhered to a uniform skeleton topology. Capitalizing on this dataset, we introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods in handling complex AI-generated meshes. Our approach integrates a Prior-Guided Skeleton Estimator (PGSE) module, which uses 2D skeleton joints to provide a preliminary 3D skeleton, and a Mesh-Skeleton Mutual Attention Network (MSMAN) that fuses skeleton features with 3D mesh features extracted by a U-shaped point transformer. This enables a coarse-to-fine 3D skeleton joint regression and a robust skinning estimation, surpassing previous methods in quality and versatility. This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.
Poster
Yuanyou Xu · Zongxin Yang · Yi Yang

[ ExHall D ]

Abstract
Controllable generation has achieved substantial progress in both 2D and 3D domains, yet current conditional generation methods still face limitations in describing detailed shape structures. Skeletons can effectively represent and describe object anatomy and pose. Unfortunately, past studies are often limited to human skeletons. In this work, we generalize skeletal conditioned generation to arbitrary structures. First, we design a reliable mesh skeletonization pipeline to generate a large-scale mesh-skeleton paired dataset.Based on the dataset, a multi-view and 3D generation pipeline is built. We propose to represent 3D skeletons by Coordinate Color Encoding as 2D conditional images. A Skeletal Correlation Module is designed to extract global skeletal features for condition injection. After multi-view images are generated, 3D assets can be obtained by incorporating a large reconstruction model, followed with a UV texture refinement stage. As a result, our method achieves instant generation of multi-view and 3D contents which are aligned with given skeletons. The proposed techniques largely improve the object-skeleton alignment and generation quality.Project page at https://skdream3d.github.io/. Dataset, code and models will be released in public.
Poster
Xingchao Yang · Takafumi Taketomi · Yuki Endo · Yoshihiro Kanamori

[ ExHall D ]

Abstract
Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.
Poster
Gangjian Zhang · Nanjie Yao · Shunsi Zhang · hanfeng Zhao · Guoliang Pang · Jian Shu · Hao Wang

[ ExHall D ]

Abstract
This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.
Poster
Zichen Tang · Yuan Yao · Miaomiao Cui · Liefeng Bo · Hongyu Yang

[ ExHall D ]

Abstract
Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like score distillation sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that …
Poster
Yingmao Miao · Zhanpeng Huang · Rui Han · Zibin Wang · Chenhao Lin · Chao Shen

[ ExHall D ]

Abstract
While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experiment results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.
Poster
Sumit Chaturvedi · Mengwei Ren · Yannick Hold-Geoffroy · Jingyuan Liu · Julie Dorsey · ZHIXIN SHU

[ ExHall D ]

Abstract
We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels must be transformed in response to changes in environmental lighting conditions. We synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting using a physically-based rendering engine. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs, produces realistic illumination effects such as specular highlights and cast shadows, while preserving identity well. Our quantitative experiments on Light Stage testing data demonstrate results comparable to state-of-the-art relighting methods trained with Light Stage data. Our qualitative results on in-the-wild images show high quality illumination effects for portraits that have never been achieved before with traditional supervision.
Poster
Junying Wang · Jingyuan Liu · Xin Sun · Krishna Kumar Singh · ZHIXIN SHU · He Zhang · Jimei Yang · Nanxuan Zhao · Tuanfeng Y. Wang · Simon Su Chen · Ulrich Neumann · Jae Shin Yoon

[ ExHall D ]

Abstract
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g, face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth.In inference time, our motion module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.
Poster
Kenji Enomoto · Scott Cohen · Brian Price · TJ Rhodes

[ ExHall D ]

Abstract
This paper considers the long-standing problem of extracting alpha mattes from video using a known background. While various color-based or polarization-based approaches have been studied in past decades, the problem remains ill-posed because the solutions solely rely on either color or polarization. We introduce Polarized Color Screen Matting, a single-shot, per-pixel matting theory for alpha matte and foreground color recovery using both color and polarization cues. Through a theoretical analysis of our diffuse-specular polarimetric compositing equation, we derive practical closed-form matting methods with their solvability conditions. Our theory concludes that an alpha matte can be extracted without manual corrections using off-the-shelf equipment such as an LCD monitor, polarization camera, and unpolarized lights with calibrated color. Experiments on synthetic and real-world datasets verify the validity of our theory and show the capability of our matting methods on real videos with quantitative and qualitative comparisons to color-based and polarization-based matting methods.
Poster
Ning Ni · Libao Zhang

[ ExHall D ]

Abstract
Recently, improving the residual structure and designing efficient convolutions have become important branches of lightweight visual reconstruction model design. We have observed that the feature addition mode (FAM) in existing residual structure tends to lead to slow feature learning or stagnation in feature evolution, a phenomenon we define as network inertia. In addition, although blueprint separable convolutions (BSConv) have proved the dominance of intra-kernel correlation, BSConv forces the blueprint to perform scale transformation on all channels, which may lead to incorrect intra-kernel correlation and introduce useless or disruptive features on some channels and hinder the effective propagation of features. Therefore, in this paper, we rethink the FAM and BSConv for super-light visual reconstruction framework design. First, we design a novel linking mode, called feature diversity evolution link (FDEL), which aims to alleviate the phenomenon of network inertia by reducing the retention of previous low-level features, thereby promoting the evolution of feature diversity. Second, we propose blueprint controllable convolutions (B2Conv). The B2Conv can adaptively pick accurate intra-kernel correlation in the depth-axis, effectively prevent the introduction of useless or disruptive features. Based on FDEL and B2Conv, we develop a super-light super-resolution (SR) framework SLVR for visual reconstruction. Both FDEL and B2Conv can …
Poster
Ping Wang · Lishun Wang · Gang Qu · Xiaodong Wang · Yulun Zhang · Xin Yuan

[ ExHall D ]

Abstract
Plug-and-play (PnP) and deep-unrolling solvers have become the de-facto standard tools for inverse problems in single-pixel imaging (SPI). The former, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep Gaussian denoiser, behaves well on interpretation and generalization, but poorly on accuracy and running time. By contrast, the latter results in better accuracy with faster inference, but cannot generalize well on varying degradations and cannot be interpreted as the proximal operator of any function. In this paper, we tackle the challenges of integrating the merits of both kinds of solvers into one. To this end, we propose a proximal optimization trajectory loss and an image restoration network to develop learned proximal restorers (LPRs) as PnP priors. LPRs approximate the proximal operator of an explicit restoration regularization and guarantees the fast convergence of PnP-HQS (half-qauadratic splitting algorithm) and PnP-ADMM (alternating direction method of multipliers). Besides, extensive experiments show that our algorithms not only achieve state-of-the-art accuracy in almost real time but also generalize well on varying degradations caused by changing imaging settings (resolution and compression ratio). The source code will be released soon.
Poster
Bojian Wu · YIFAN PENG · Ruizhen Hu · Xiaowei Zhou

[ ExHall D ]

Abstract
The challenge of image-based 3D reconstruction for glossy objects lies in separating diffuse and specular components on glossy surfaces from captured images, a task complicated by the ambiguity in discerning lighting conditions and material properties using RGB data alone. While state-of-the-art methods rely on tailored and/or high-end equipment for data acquisition, which can be cumbersome and time-consuming, this work introduces a scalable polarization-aided approach that employs cost-effective acquisition tools. By attaching a linear polarizer to readily available RGB cameras, multi-view polarization images can be captured without the need for advance calibration or precise measurements of the polarizer angle, substantially reducing system construction costs. The proposed approach represents polarimetric BRDF, Stokes vectors, and polarization states of object surfaces as neural implicit fields. These fields, combined with the polarizer angle, are retrieved by optimizing the rendering loss of input polarized images. By leveraging fundamental physical principles for the implicit representation of polarization rendering, our method demonstrates superiority over existing techniques through experiments on public datasets and real captured images. Code and captured data will be made available upon acceptance for further evaluation.
Poster
Wei Xu · Charlie Wagner · Junjie Luo · Qi Guo

[ ExHall D ]

Abstract
Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and blurriness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be analytically calculated using a novel DfD relation we derive. Our experiment shows that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
Poster
Xiaoyan Xing · Konrad Groh · Sezer Karaoglu · Theo Gevers · Anand Bhattad

[ ExHall D ]

Abstract
We introduce LUMINET, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LUMINET synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy to enhance training efficiency for effective relighting, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning.Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LUMINET processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.
Poster
Chao Wang · Zhihao Xia · Thomas Leimkuehler · Karol Myszkowski · Xuaner Zhang

[ ExHall D ]

Abstract
While consumer displays increasingly support more than 10 stops of dynamic range, most image assets — such as internet photographs and generative AI content — remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
Poster
Chih-Hao Lin · Jia-Bin Huang · Zhengqin Li · Zhao Dong · Christian Richardt · Michael Zollhoefer · Tuotuo Li · Johannes Kopf · Shenlong Wang · Changil Kim

[ ExHall D ]

Abstract
Inverse rendering seeks to recover 3D geometry, surface material, and lighting from captured images, enabling advanced applications such as novel-view synthesis, relighting, and virtual object insertion. However, most existing techniques rely on high dynamic range (HDR) images as input, limiting accessibility for general users. In response, we introduce IRIS, an inverse rendering framework that recovers the physically based material, spatially-varying HDR lighting, and camera response functions from multi-view, low-dynamic-range (LDR) images. By eliminating the dependence on HDR input, we make inverse rendering technology more accessible. We evaluate our approach on real-world and synthetic scenes and compare it with state-of-the-art methods. Our results show that IRIS effectively recovers HDR lighting, accurate material, and plausible camera response functions, supporting photorealistic relighting and object insertion.
Poster
Hoon-Gyu Chung · Seokjun Choi · Seung-Hwan Baek

[ ExHall D ]

Abstract
Inverse rendering seeks to reconstruct both geometry and spatially varying BRDFs (SVBRDFs) from captured images. To address the inherent ill-posedness of inverse rendering, basis BRDF representations are commonly used, modeling SVBRDFs as spatially varying blends of a set of basis BRDFs. However, existing methods often yield basis BRDFs that lack intuitive separation and have limited scalability to scenes of varying complexity.In this paper, we introduce a differentiable inverse rendering method that produces interpretable basis BRDFs. Our approach models a scene using 2D Gaussians, where the reflectance of each Gaussian is defined by a weighted blend of basis BRDFs.We efficiently render an image from the 2D Gaussians and basis BRDFs using differentiable rasterization and impose a rendering loss with the input images.During this analysis-by-synthesis optimization process of differentiable inverse rendering, we dynamically adjust the number of basis BRDFs to fit the target scene while encouraging sparsity in the basis weights. This ensures that the reflectance of each Gaussian is represented by only a few basis BRDFs.This approach enables the reconstruction of accurate geometry and interpretable basis BRDFs that are spatially separated. Consequently, the resulting scene representation, comprising basis BRDFs and 2D Gaussians, supports physically-based novel-view relighting and intuitive scene editing.
Poster
Samuel Rota Bulò · Lorenzo Porzi · Nemanja Bartolovic · Peter Kontschieder

[ ExHall D ]

Abstract
We present a novel, hardware rasterized rendering approach for ray-based 3D Gaussian Splatting (RayGS), obtaining both fast and high-quality results for novel view synthesis. Our work contains a mathematically rigorous and geometrically intuitive derivation about how to efficiently estimate all relevant quantities for rendering RayGS models, structured with respect to standard hardware rasterization shaders. Our solution is the first enabling rendering RayGS models at sufficiently high frame rates to support quality-sensitive applications like Virtual and Mixed Reality. Our second contribution enables alias-free rendering for RayGS, by addressing mip-related issues arising when rendering diverging scales during training and testing. We demonstrate significant performance gains, across different benchmark scenes, while retaining state-of-the-art appearance quality of RayGS.
Poster
Chun Gu · Xiaofei Wei · Li Zhang · Xiatian Zhu

[ ExHall D ]

Abstract
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and …
Poster
Zhengqin Li · Dilin Wang · Ka chen · Zhaoyang Lv · Thu Nguyen-Phuoc · Milim Lee · Jia-Bin Huang · Lei Xiao · Yufeng Zhu · Carl Marshall · Carl Ren · Richard Newcombe · Zhao Dong

[ ExHall D ]

Abstract
We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compared favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
Poster
Yutao Feng · Xiang Feng · Yintong Shang · Ying Jiang · Chang Yu · Zeshun Zong · Tianjia Shao · Hongzhi Wu · Kun Zhou · Chenfanfu Jiang · Yin Yang

[ ExHall D ]

Abstract
We demonstrate the feasibility of integrating physics-based animations of solids and fluids with 3D Gaussian Splatting (3DGS) to create novel effects in virtual scenes reconstructed using 3DGS. Leveraging the coherence of the Gaussian Splatting and Position-Based Dynamics (PBD) in the underlying representation, we manage rendering, view synthesis, and the dynamics of solids and fluids in a cohesive manner. Similar to GaussianShader, we enhance each Gaussian kernel with an added normal, aligning the kernel's orientation with the surface normal to refine the PBD simulation. This approach effectively eliminates spiky noises that arise from rotational deformation in solids. It also allows us to integrate physically based rendering to augment the dynamic surface reflections on fluids. Consequently, our framework is capable of realistically reproducing surface highlights on dynamic fluids and facilitating interactions between scene objects and fluids from new views.
Poster
Aditya Chetan · Guandao Yang · Zichen Wang · Steve Marschner · Bharath Hariharan

[ ExHall D ]

Abstract
Neural fields have become widely used in various fields, from shape representation to neural rendering, and for solving partial differential equations (PDEs). With the advent of hybrid neural field representations like Instant NGP that leverage small MLPs and explicit representations, these models train quickly and can fit large scenes. Yet in many applications like rendering and simulation, hybrid neural fields can cause noticeable and unreasonable artifacts. This is because they do not yield accurate spatial derivatives needed for these downstream applications. In this work, we propose two ways to circumvent these challenges. Our first approach is a post hoc operator that uses local polynomial fitting to obtain more accurate derivatives from pre-trained hybrid neural fields. Additionally, we also propose a self-supervised fine-tuning approach that refines the hybrid neural field to yield accurate derivatives directly while preserving the initial signal. We show applications of our method to rendering, collision simulation, and solving PDEs. We observe that using our approach yields more accurate derivatives, reducing artifacts and leading to more accurate simulations in downstream applications.
Poster
Feixiang He · Jiangbei Yue · Jialin Zhu · Armin Seyfried · Dan Casas · Julien Pettré · He Wang

[ ExHall D ]

Abstract
Video-based high-density crowd analysis and prediction has been a long-standing topic in computer vision. It is notoriously difficult due to, but not limited to, the lack of high-quality data and complex crowd dynamics. Consequently, it has been relatively under studied. In this paper, we propose a new approach that aims to learn from in-the-wild videos, often with low quality where it is difficult to track individuals or count heads. The key novelty is a new physics prior to model crowd dynamics. We model high-density crowds as active matter, a continumm with active particles subject to stochastic forces, named `crowd material'. Our physics model is combined with neural networks, resulting in a neural stochastic differential equation system which can mimic the complex crowd dynamics. Due to the lack of similar research, we adapt a range of existing methods which are close to ours, and other possible alternatives for comparison. Through exhaustive evaluation, we show our model outperforms existing methods in analyzing and forecasting extremely high-density crowds. Furthermore, since our model is a continuous-time physics model, it can be used for simulation and analysis, providing strong interpretability. This is categorically different from most deep learning methods, which are discrete-time models and black-boxes.
Poster
Bojun Xiong · Jialun Liu · JiaKui Hu · Chenming Wu · Jinbo Wu · Xing Liu · Chen Zhao · Errui Ding · Zhouhui Lian

[ ExHall D ]

Abstract
Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models will be …
Poster
Guoxing Sun · Rishabh Dabral · Heming Zhu · Pascal Fua · Christian Theobalt · Marc Habermann

[ ExHall D ]

Abstract
Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or they ignore relevant sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address this issue, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.
Poster
Zhipeng Huang · Wangbo Yu · Xinhua Cheng · ChengShu Zhao · Yunyang Ge · Mingyi Guo · Li Yuan · Yonghong Tian

[ ExHall D ]

Abstract
Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present **RoomPainter**, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (**MVIS**) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the **MVIS**, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (**MVRS**) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
Poster
Wei Cheng · Juncheng Mu · Xianfang Zeng · Xin Chen · Anqi Pang · Chi Zhang · Zhibin Wang · Bin Fu · Gang Yu · Ziwei Liu · Liang Pan

[ ExHall D ]

Abstract
Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D …
Poster
Qiao Yu · Xianzhi Li · Yuan Tang · Xu Han · Long Hu · yixue Hao · Min Chen

[ ExHall D ]

Abstract
Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
Poster
Nissim Maruani · Wang Yifan · Matthew Fisher · Pierre Alliez · Mathieu Desbrun

[ ExHall D ]

Abstract
This paper proposes a new 3D generative model that learns to synthesize shape variations based on a single example. While generative methods for 3D objects have recently attracted much attention, current techniques often lack geometric details and/or require long training times and large resources. Our approach remedies these issues by combining sparse voxel grids and multiscale point, normal, and color sampling within an encoder-free neural architecture that can be trained efficiently and in parallel. We show that our resulting variations better capture the fine details of their original input and can capture more general types of surfaces than previous SDF-based methods. Moreover, we offer interactive generation of 3D shape variants, allowing more human control in the design loop if needed.
Poster
Daoyi Gao · Mohd Yawar Nihal Siddiqui · Lei Li · Angela Dai

[ ExHall D ]

Abstract
Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages.First, we generate a high-level articulation-aware object structure; then, based on this structural information, we synthesize each part's mesh faces. Key to our approach is modeling both articulation structures and part meshes as sequences of quantized triangle embeddings, leading to a unified hierarchical framework with transformers for autoregressive generation.Object part structures are first generated as their bounding primitives and articulation modes; a second transformer, guided by these articulation structures, then generates each part's mesh triangles.To ensure coherency among generated parts, we introduce structure-guided conditioning that also incorporates local part mesh connectivity.MeshArt shows significant improvements over state of the art, with 57.1% improvement in structure coverage and a 209-point improvement in mesh generation FID.
Poster
Aleksei Bokhovkin · Quan Meng · Shubham Tulsiani · Angela Dai

[ ExHall D ]

Abstract
We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.
Poster
Ziya Erkoc · Can Gümeli · Chaoyang Wang · Matthias Nießner · Angela Dai · Peter Wonka · Hsin-Ying Lee · Peiye Zhuang

[ ExHall D ]

Abstract
We propose a training-free approach to 3D editing that enables the editing of a single shape and the reconstruction of a mesh within a few minutes. Leveraging 4-view images, user-guided text prompts, and rough 2D masks, our method produces an edited 3D mesh that aligns with the prompt. For this, our approach performs synchronized multi-view image editing in 2D. However, targeted regions to be edited are ambiguous due to projection from 3D to 2D. To ensure precise editing only in intended regions, we develop a 3D segmentation pipeline that detects edited areas in 3D space. Additionally, we introduce a merging algorithm to seamlessly integrate edited 3D regions with original input. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling fast, high-quality editing while preserving unintended regions.
Poster
Quan Meng · Lei Li · Matthias Nießner · Angela Dai

[ ExHall D ]

Abstract
We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
Poster
Yian Zhao · Wanshi Xu · Ruochong Zheng · Pengchong Qiao · Chang Liu · Jie Chen

[ ExHall D ]

Abstract
The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation.However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results.Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation.To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose **i**nteractive **Seg**ment-and-**Man**ipulate 3D Gaussians (**iSegMan**), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view.To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (**EIP**), which innovatively exploits epipolar constraint for efficient and robust interaction matching.To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (**VGV**), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility.Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a **Manipulation Toolbox** to implement various functions on selected regions, enhancing the …
Poster
Jianxiong Shen · Yue Qian · Xiaohang Zhan

[ ExHall D ]

Abstract
Current 3D Gaussian Splatting methods often overlook structured information, leading to disorganized 3D Gaussians that compromise memory and structure efficiency, particularly in Level-of-Detail (LOD) management. This inefficiency results in excessive memory use, limiting their application in resource-sensitive environments like virtual reality and interactive simulations. To overcome these challenges, we introduce a scalable Gaussian Soup that enables high-performance LOD management with progressively reduced memory usage. Our method utilizes triangle primitives with Gaussian splats embedded and adaptive pruning/growing strategies to ensure high-quality scene rendering with significantly reduced memory demands. By embedding neural Gaussians within the triangle primitives through the triangle-MLP, we achieve further memory savings while maintaining rendering fidelity. Experimental results demonstrate that our approach achieves consistently superior performance than recent leading techniques across various LOD while progressively reducing memory usage.
Poster
Yifei Liu · Zhihang Zhong · Yifan Zhan · Sheng Xu · Xiao Sun

[ ExHall D ]

Abstract
While 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and real-time rendering, the high memory consumption due to the use of millions of Gaussians limits its practicality. To mitigate this issue, improvements have been made by pruning unnecessary Gaussians, either through a hand-crafted criterion or by using learned masks. However, these methods deterministically remove Gaussians based on a snapshot of the pruning moment, leading to sub-optimized reconstruction performance from a long-term perspective. To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. Hence, the importance of Gaussians iteratively changes and the pruned Gaussians are selected diversely. Various experimental results demonstrate the superiority of the proposed method in achieving better rendering quality with fewer Gaussians, compared to previous pruning methods.
Poster
Kun Yang · Yuxiang Liu · Zeyu Cui · Yu Liu · Maojun Zhang · Shen Yan · Qing Wang

[ ExHall D ]

Abstract
Thermal infrared imaging offers the advantage of all-weather capability, enabling non-intrusive measurement of an object's surface temperature. Consequently, thermal infrared images are employed to reconstruct 3D models that accurately reflect the temperature distribution of a scene, aiding in applications such as building monitoring and energy management. However, existing approaches predominantly focus on static 3D reconstruction for a single time period, overlooking the impact of environmental factors on thermal radiation and failing to predict or analyze temperature variations over time. To address these challenges, we propose the NTR-Gaussian method, which treats temperature as a form of thermal radiation, incorporating elements like convective heat transfer and radiative heat dissipation. Our approach utilizes neural networks to predict thermodynamic parameters such as emissivity, convective heat transfer coefficient, and heat capacity. By integrating these predictions, we can accurately forecast thermal temperatures at various times throughout a nighttime scene. Furthermore, we introduce a dynamic dataset specifically for nighttime thermal imagery. Extensive experiments and evaluations demonstrate that NTR-Gaussian significantly outperforms comparison methods in thermal reconstruction, achieving a predicted temperature error within 1 degree Celsius.
Poster
Yexing Xu · Longguang Wang · Minglin Chen · Sheng Ao · Li Li · Yulan Guo

[ ExHall D ]

Abstract
Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under spare inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU.
Poster
Yecong Wan · Mingwen Shao · Yuanshuo Cheng · Wangmeng Zuo

[ ExHall D ]

Abstract
In this paper, we aim ambitiously for a realistic yet challenging problem, namely, how to reconstruct high-quality 3D scenes from sparse low-resolution views that simultaneously suffer from deficient perspectives and contents. Whereas existing methods only deal with either sparse views or low-resolution observations, they fail to handle such hybrid and complicated scenarios. To this end, we propose a novel Sparse-view Super-resolution 3D Gaussian Splatting framework, dubbed S2Gaussian, that can reconstruct structure-accurate and detail-faithful 3D scenes with only sparse and low-resolution views. The S2Gaussian operates in a two-stage fashion. In the first stage, we initially optimize a low-resolution Gaussian representation with depth regularization and densify it to initialize the high-resolution Gaussians through a tailored Gaussian Shuffle operation. In the second stage, we refine the high-resolution Gaussians with the super-resolved images generated from both original sparse views and pseudo-views rendered by the low-resolution Gaussians. In which a customized blur-free inconsistency modeling scheme and a 3D robust optimization strategy are elaborately designed to mitigate multi-view inconsistency and eliminate erroneous updates caused by imperfect supervision. Extensive experiments demonstrate superior results and in particular establishing new state-of-the-art performances with more consistent geometry and finer details.
Poster
Yihao Wang · Marcus Klasson · Matias Turkulainen · Shuzhe Wang · Juho Kannala · Arno Solin

[ ExHall D ]

Abstract
Gaussian splatting enables fast novel-view synthesis in static 3D environments. However, reconstructing real-world environments remains challenging as distractors or occluders break the multi-view consistency assumption required for good 3D reconstruction. Most existing methods for unconstrained novel-view synthesis rely on external semantic information from foundation models, which introduces additional computational overhead as pre-processing steps or during optimization. In this work, we propose a novel method that directly separates occluders and static scene elements purely based on volume-rendering of Gaussian primitives. We model occluders in training images as two-dimensional hallucinations embedded in the camera views and apply alpha-blending in a back-to-front manner in which a 3D scene and 2D occluders are explicitly modelled within the alpha-compositing stages. Our approach implicitly separates the scene into distinct static and distractor elements achieving comparable results to prior approaches with only photometric supervision. We demonstrate the effectiveness of our approach on in-the-wild datasets.
Poster
Zhihao Liu · Zhanglin Cheng · Naoto Yokoya

[ ExHall D ]

Abstract
Obtaining high-quality, practically usable 3D models of biological plants remains a significant challenge in computer vision and graphics. In this paper, we present a novel method for generating realistic 3D plant models from single-view photographs. Our approach employs a neural decomposition technique to learn a lightweight hierarchical box representation from the image, effectively capturing the structures and botanical features of plants. Then, this representation can be subsequently refined through a shape-guided parametric modeling module to produce complete 3D plant models. By combining hierarchical learning and parametric modeling, our method generates structured 3D plant assets with fine geometric details. Notably, through learning the decomposition in different levels of detail, our method can adapt to two distinct plant categories: outdoor trees and houseplants, each with unique appearance features. Within the scope of plant modeling, our method is the first comprehensive solution capable of reconstructing both plant categories from single-view images.
Poster
Xiang Li · Zixuan Huang · Anh Thai · James Rehg

[ ExHall D ]

Abstract
Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.
Poster
Zhao Dong · Ka chen · Zhaoyang Lv · Hong-Xing Yu · Yunzhi Zhang · Cheng Zhang · Yufeng Zhu · Stephen Tian · Zhengqin Li · Geordie Moffatt · Sean Christofferson · James Fort · Xiaqing Pan · Mingfei Yan · Jiajun Wu · Carl Ren · Richard Newcombe

[ ExHall D ]

Abstract
We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. We will make the full dataset …
Poster
Vitor Guizilini · Zubair Irshad · Dian Chen · Greg Shakhnarovich · Rares Andrei Ambrus

[ ExHall D ]

Abstract
Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.
Poster
Lingen Li · Zhaoyang Zhang · Yaowei Li · Jiale Xu · Wenbo Hu · Xiaoyu Li · Weihao Cheng · Jinwei Gu · Tianfan Xue · Ying Shan

[ ExHall D ]

Abstract
Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods still depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views.In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from pretrained dense stereo models during training.Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems.
Poster
Jingyu Lin · Jiaqi Gu · Lubin Fan · Bojian Wu · Yujing Lou · Renjie Chen · Ligang Liu · Jieping Ye

[ ExHall D ]

Abstract
Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.
Poster
Hanwen Liang · Junli Cao · Vidit Goel · Guocheng Qian · Sergei Korolev · Demetri Terzopoulos · Konstantinos N. Plataniotis · Sergey Tulyakov · Jian Ren

[ ExHall D ]

Abstract
This paper addresses a challenging question: how can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image?Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality, and distorted reconstructions for unseen areas. We propose a novel pipeline to overcome these limitations.Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splatting, even from a single-condition image, in a feed-forward manner. The video diffusion model is designed to precisely follow a specified camera trajectory, allowing it to generate compressed latents that contain multi-view information while maintaining 3D consistency.We further train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the generation of high-quality, wide-scope, and generic 3D scenes.Extensive evaluations on various datasets show that our model significantly outperforms existing methods for single-view 3D rendering, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
Poster
Zhen Lv · Yangqi Long · Congzhentao Huang · Cao Li · Chengfei Lv · Hao Ren · Dian Zheng

[ ExHall D ]

Abstract
Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods mainly handle these problems by directly applying novel view synthesis (NVS) methods to video, which is naturally unsuitable. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.
Poster
Yunzhi Yan · Zhen Xu · Haotong Lin · Haian Jin · Haoyu Guo · Yida Wang · Kun Zhan · XianPeng Lang · Hujun Bao · Xiaowei Zhou · Sida Peng

[ ExHall D ]

Abstract
This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensors data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes,but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control.Moreover, the utilization of pixel-level LiDAR condition allows us to make accurate pixel-level edits to target scenes.In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open and Pandaset datasets demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.
Poster
Jiadong Tang · Yu Gao · Dianyi Yang · Liqi Yan · Yufeng Yue · Yi Yang

[ ExHall D ]

Abstract
Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.
Poster
Cong Ruan · Yuesong Wang · Bin Zhang · Lili Ju · Tao Guan

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has shown impressive performance in scene reconstruction, offering high rendering quality and rapid rendering speed with short training time. However, it often yields unsatisfactory results when applied to indoor scenes due to its poor ability to learn geometries without enough textural information. In this paper, we propose a new 3DGS-based method "IndoorGS", that leverages the commonly found yet important geometric cues in indoor scenes to improve the reconstruction quality. Specifically, we first extract 2D lines from input images and fuse them into 3D line cues via feature-based matching, which can provide a structural understanding of the target scene. We then apply the statistical outlier removal to refine Structure-from-Motion (SfM) points, ensuring robust cues in texture-rich areas. Based on these two types of cues, we further extract reliable 3D plane cues for textureless regions. Such geometric information will be utilized not only for initialization but also in the realization of a geometric-cue-guided adaptive density control (ADC) strategy. The proposed ADC approach is grounded in the principle of divide-and-conquer and optimizes the use of each type of geometric cues to enhance overall reconstruction performance. Extensive experiments on multiple indoor datasets show that our method can deliver much more …
Poster
Xiaohao Xu · Feng Xue · Shibo Zhao · Yike Pan · Sebastian Scherer · Xiaonan Huang

[ ExHall D ]

Abstract
Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency.To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction via Multi-Agent Gaussian Consensus. MAC-Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra-Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter-Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi-agent Gaussian splats, seamlessly integrates them into a high-fidelity 3D model. Leveraging Gaussian primitives, MAC-Ego3D supports efficient RGB-D rendering, enabling rapid inter-agent Gaussian association and alignment.MAC-Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real-world benchmarks, achieving a 15× increase in inference speed, order-of-magnitude reductions in ego-motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available.
Poster
Sangmin Kim · Seunguk Do · Jaesik Park

[ ExHall D ]

Abstract
Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present VirtualPCR, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room (PCR). In VirtualPCR, a 3DLocator module locates recovered actors on the stage using depth prior, and it estimates unseen human poses via interpolation. The proposed ShotMatcher module tracks the actors under shot changes. VirtualPCR has a face-fitting branch that dynamically recover the actors’ expressions. Experiments on the Sitcoms3D dataset show that our pipeline can reassemble TV shows with new cameras at different time steps. We also show that VirtualPCR offers interesting applications, such as scene editing, synthetic shot-making, relocating actors, and human pose manipulation.
Poster
Qiang Hu · Zihan Zheng · Houqiang Zhong · Sihua Fu · Li Song · Xiaoyun Zhang · Guangtao Zhai · Yanfeng Wang

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD) trade-off during training, leading to performance degradation and increased model redundancy. To address this gap, we propose 4DGC, a novel rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV. Specifically, 4DGC introduces a motion-aware dynamic Gaussian representation that utilizes a compact motion grid combined with sparse compensated Gaussians to exploit inter-frame similarities. This representation effectively handles large motions, preserving quality and reducing temporal redundancy. Furthermore, we present an end-to-end compression scheme that employs differentiable quantization and a tiny implicit entropy model to compress the motion grid and compensated Gaussians efficiently. The entire framework is jointly optimized using a rate-distortion trade-off. Extensive experiments demonstrate that 4DGC supports variable bitrates and consistently outperforms existing methods in RD performance across multiple datasets.
Poster
Yiming Liang · Tianhan Xu · Yuta Kikuchi

[ ExHall D ]

Abstract
We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives that is capable of achieving high-quality monocular dynamic 3D reconstruction. HiMoR's foundation is the insight that motions in everyday scenes can be decomposed into coarser motion, which forms the basis for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes depicting coarse motion for temporal smoothness and deeper nodes denoting finer motion. Additionally, our model uses a few shared motion bases to represent motions of different set of nodes, aligning with the assumption of motion being generally low-rank. This motion representation design provides Gaussians with a more rational deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as a substitute, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to effectively reflect true reconstruction quality. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.
Poster
Siyuan Shen · Tianjia Shao · Kun Zhou · Chenfanfu Jiang · Yin Yang

[ ExHall D ]

Abstract
This paper presents a novel pipeline named EnliveningGS, which enables active locomotion of 3D models represented with 3D Gaussian splatting (3DGS). We are inspired by the fact that real-world lives pose their bodies in a natural and physically meaningful manner by compressing or elongating muscle fibers embedded in the body. EnliveningGS aims to replicate the similar functionality of 3DGS models so that the object within a 3DGS scene acts like a living creature rather than a static shape --- they walk, jump, and twist in the scene under provided motion trajectories driven by muscle activations. While the concept is straightforward, many challenging technical difficulties need to be taken care of. Synthesizing realistic locomotion of a 3DGS model embodies an inverse physics problem of very high dimensions. The core challenge is how to efficiently and robustly model frictional contacts between an enlivened model'' and the environment, as it is the composition of contact/collision/friction forces triggered by muscle activation that generates the final movement of the object. We propose a hybrid numerical method mixing LCP and penalty method to tackle this NP-hard problem robustly. Our pipeline also addresses the limitation of existing 3DGS deformation algorithms and inpainting the missing information when models …
Poster
Hongye Cheng · Tianyu Wang · guangsi shi · Zexing Zhao · Yanwei Fu

[ ExHall D ]

Abstract
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation.
Poster
Haolin Liu · Xiaohang Zhan · Zizheng Yan · Zhongjin Luo · Yuxin Wen · Xiaoguang Han

[ ExHall D ]

Abstract
Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in more challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.
Poster
Bohan Yu · Jinxiu Liang · Zhuofeng Wang · Bin Fan · Art Subpaasa · Boxin Shi · Imari Sato

[ ExHall D ]

Abstract
Hyperspectral imaging plays a critical role in numerous scientific and industrial fields. Conventional hyperspectral imaging systems often struggle with the trade-off between capture speed, spectral resolution, and bandwidth, particularly in dynamic environments. In this work, we present a novel event-based active hyperspectral imaging system designed for real-time capture with low bandwidth in dynamic scenes. By combining an event camera with a dynamic illumination strategy, our system achieves unprecedented temporal resolution while maintaining high spectral fidelity, all at a fraction of the bandwidth requirements of traditional systems. Unlike basis-based methods that sacrifice spectral resolution for efficiency, our approach enables continuous spectral sampling through an innovative sweeping rainbow" illumination pattern synchronized with a rotating mirror array. The key insight is leveraging the sparse, asynchronous nature of event cameras to encode spectral variations as temporal contrasts, effectively transforming the spectral reconstruction problem into a series of geometric constraints. Extensive evaluations on both synthetic and real data demonstrate that our system outperforms state-of-the-art methods in temporal resolution while maintaining competitive spectral reconstruction quality.
Poster
Yaniv Benny · Lior Wolf

[ ExHall D ]

Abstract
This paper proposes a novel method for omnidirectional 360\degree perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360\degree perception benchmarks for depth estimation and semantic segmentation. Our code is attached as supplementary material.
Poster
Huan Zheng · Wencheng Han · Jianbing Shen

[ ExHall D ]

Abstract
Recovering high-quality depth maps from compressed sources has gained significant attention due to the limitations of consumer-grade depth cameras and the bandwidth restrictions during data transmission.However, current methods still suffer from two challenges.First, bit-depth compression produces a uniform depth representation in regions with subtle variations, hindering the recovery of detailed information.Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene.To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately.To be specific, we propose the fine geometry detail encoder (FGDE), which is designed to aggregate fine geometry details in high-resolution low-level image features while simultaneously enriching them with complementary information from low-resolution context-level image features.In addition, we develop the global geometry encoder (GGE) that aims at suppressing noise and extracting global geometric information effectively via constructing compact feature representation in a low-rank space.We conduct experiments on multiple benchmark datasets, demonstrating that our GDNet significantly outperforms current methods in terms of geometric consistency and detail recovery. Our codes will be available.
Poster
Hongkai Lin · Dingkang Liang · Zhenghao Qi · Xiang Bai

[ ExHall D ]

Abstract
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for comprehensively understanding underwater scenes.Nevertheless, high-quality and large-scale underwater datasets with dense annotations remained scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model.The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations.We synthesize a large underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks.The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations.Our method can offer new perspectives on alleviating data scarcity issues in other fields.
Poster
Jianing Li · Yunjian Zhang · Haiqian Han · Xiangyang Ji

[ ExHall D ]

Abstract
Conventional frame-based imaging for active stereo systems has encountered major challenges in fast-motion scenarios. However, how to design a novel paradigm for high-speed depth sensing still remains an open issue. In this paper, we propose a novel problem setting, namely active stereo event-based vision, which first integrates binocular event cameras and an infrared projector for high-speed depth sensing. Technically, we first build a stereo camera prototype system and present a real-world dataset with over 21.5k spatiotemporal synchronized labels at 15 Hz, while also creating a realistic synthetic dataset with stereo event streams and 23.8k synchronized labels at 20 Hz. Then, we propose ActiveEventNet, a lightweight yet effective active event-based stereo matching neural network that learns to generate high-quality dense disparity maps from stereo event streams with low latency. Experiments demonstrate that our ActiveEventNet outperforms state-of-the-art methods meanwhile significantly reducing computational complexity. Our solution offers superior depth sensing compared to conventional stereo cameras in high-speed scenarios, while also achieving the inference speed of up to 150 FPS with our prototype. We believe that this novel paradigm will provide new insights into future depth sensing systems. Our dataset descriptions and open-source code will be available in the supplemental material.
Poster
Zidong Cao · Jinjing Zhu · Weiming Zhang · Hao Ai · Haotian Bai · Hengshuang Zhao · Lin Wang

[ ExHall D ]

Abstract
Recently, Depth Anything Models (DAMs) - a type of depth foundation models - have demonstrated impressive zero-shot capabilities across diverse perspective images. Despite its success, it remains an open question regarding DAMs' performance on panorama images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To address this gap, we conduct an empirical analysis to evaluate the performance of DAMs on panoramic images and identify their limitations. For this, we undertake comprehensive experiments to assess the performance of DAMs from three key factors: panoramic representations, 360 camera positions for capturing scenarios, and spherical spatial transformations. This way, we reveal some key findings, e.g., DAMs are sensitive to spatial transformations. We then propose a semi-supervised learning (SSL) framework to learn a panoramic DAM, dubbed PanDA. Under the umbrella of SSL, PanDA first learns a teacher model by fine-tuning DAM through joint training on synthetic indoor and outdoor panoramic datasets. Then, a student model is trained using large-scale unlabeled data, leveraging pseudo-labels generated by the teacher model. To enhance PanDA's generalization capability, M\"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the predicted depth maps from the original and spatially transformed ones. This subtly improves the student …
Poster
Xunzhi Zheng · Dan Xu

[ ExHall D ]

Abstract
Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our code and models will be made publicly available upon acceptance.
Poster
Jiaxi Deng · Yushen Wang · Haitao Meng · Zuoxun Hou · Yi Chang · Gang Chen

[ ExHall D ]

Abstract
Fast and reliable omnidirectional 3D sensing is essential to many applications such as autonomous driving, robotics and drone navigation. While many well-recognized methods have been developed to produce high-quality omnidirectional 3D information, they are too slow for real-time computation, limiting their feasibility in practical applications. Motivated by these shortcomings, we propose an efficient omnidirectional depth sensing framework, called OmniStereo, which generates high-quality 3D information in real-time. Unlike prior works, OmniStereo employs Cassini projection to simplify the photometric matching and introduces a lightweight stereo matching network to minimize computational overhead. Additionally, OmniStereo proposes a novel fusion method to handle depth discontinuities and invalid pixels complemented by a refinement module to reduce mapping-introduced errors and recover fine details. As a result, OmniStereo achieves state-of-the-art (SOTA) accuracy, surpassing the second-best method over 32\% in MAE, while maintaining real-time efficiency. It operates more than 16.5× faster than the second-best method in accuracy on TITAN RTX, achieving 12.3 FPS on embedded device Jetson AGX Orin, underscoring its suitability for real-world deployment. The code will be open-sourced upon acceptance of the paper.
Poster
Luca Bartolomei · Fabio Tosi · Matteo Poggi · Stefano Mattoccia

[ ExHall D ]

Abstract
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
Poster
Luigi Piccinelli · Christos Sakaridis · Mattia Segu · Yung-Hsu Yang · Siyuan Li · Wim Abbeloos · Luc Van Gool

[ ExHall D ]

Abstract
Monocular 3D estimation is crucial for visual perception.However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images.These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss.To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models.Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics.We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras.A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models will be made public.
Poster
Yihan Wang · Linfei Pan · Marc Pollefeys · Viktor Larsson

[ ExHall D ]

Abstract
In this paper, we present a new Structure-from-Motion pipeline that uses a non-parametric camera projection model. The model is self-calibrated during the reconstruction process and can fit a wide variety of cameras, ranging from simple low-distortion pinhole cameras to more extreme optical systems such as fisheye or catadioptric cameras. The key component in our framework is an adaptive calibration procedure that can estimate partial calibrations, only modeling regions of the image where sufficient constraints are available. In experiments, we show that our method achieves comparable accuracy in scenarios where traditional Structure-from-Motion pipelines excel and it outperforms its parametric opponents in cases where they are unable to self-calibrate their parametric models.
Poster
Yohann Cabon · Lucas Stoffl · Leonid Antsfeld · Gabriela Csurka · Boris Chidlovskii · Jerome Revaud · Vincent Leroy

[ ExHall D ]

Abstract
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUNSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and SLAM scenarios showing state-of-the-art …
Poster
Hana Bezalel · Dotan Ankri · Ruojin Cai · Hadar Averbuch-Elor

[ ExHall D ]

Abstract
We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods. We will release our data, code, and trained models.
Poster
Wonbong Jang · Philippe Weinzaepfel · Vincent Leroy · Lourdes Agapito · Jerome Revaud

[ ExHall D ]

Abstract
We present Pow3R, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3R incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time.This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion.Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3R at exploiting all available information.
Poster
Maxime Pietrantoni · Gabriela Csurka · Torsten Sattler

[ ExHall D ]

Abstract
Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Localization is solved through pose refinement by aligning feature maps or segmentations from a query image and feature maps or segmentations rendered from the GSFFs scene representation. The resulting privacy- and non-privacy-preserving localization pipelines are evaluated on multiple real world datasets, and outperform state-of-the-art baselines.
Poster
Jonathan Astermark · Anders Heyden · Viktor Larsson

[ ExHall D ]

Abstract
In this paper, we speed up robust estimation from dense two-view correspondences. Previous work has shown that dense matchers can significantly improve both the accuracy and robustness of two-view relative pose estimation. However, the large number of matches comes with a significantly increased runtime during robust estimation. To avoid this, we propose an efficient match summarization scheme which provides comparable accuracy to using the full set of dense matches, while having 10-100x faster runtime. We validate our approach on standard benchmark datasets together with multiple state-of-the-art dense matchers.
Poster
Honggyu An · Jin Hyeon Kim · Seonghoon Park · Sunghwan Hong · Jaewoo Jung · Jisang Han · Seungryong Kim

[ ExHall D ]

Abstract
In this work, we analyze new aspects of cross-view completion, mainly through the analogy of cross-view completion and traditional self-supervised correspondence learning algorithms. Based on our analysis, we reveal that the cross-attention map of Croco-v2, best reflects this correspondence information compared to other correlations from the encoder or decoder features. We further verify the effectiveness of the cross-attention map by evaluating on both zero-shot and supervised dense geometric correspondence and multi-frame depth estimation.
Poster
David Yifan Yao · Albert J. Zhai · Shenlong Wang

[ ExHall D ]

Abstract
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing large visual models for 4D understanding.
Poster
Yuzhen Liu · Qiulei Dong

[ ExHall D ]

Abstract
Relative camera pose estimation between two images is a fundamental task in 3D computer vision. Recently, many relative pose estimation networks have been explored for learning a mapping from two input images to their corresponding relative pose, however, the estimated relative poses by these methods do not have the intrinsic Pose Permutation Equivariance (PPE) property: the estimated relative pose from Image A to Image B should be the inverse of that from Image B to Image A. It means that permuting the input order of two images would cause these methods to obtain inconsistent relative poses. To address this problem, we firstly introduce the concept of PPE mapping, which indicates such a mapping that captures the intrinsic PPE property of relative poses. Then by enforcing the aforementioned PPE property, we propose a general framework for relative pose estimation, called EquiPose, which could easily accommodate various relative pose estimation networks in literature as its baseline models. We further theoretically prove that the proposed EquiPose framework could guarantee that its obtained mapping is a PPE mapping. Given a pre-trained baseline model, the proposed EquiPose framework could improve its performance even without fine-tuning, and could further boost its performance with fine-tuning. Experimental results …
Poster
Krispin Wandel · Hesheng Wang

[ ExHall D ]

Abstract
Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%.
Poster
Yufu Wang · Yu Sun · Priyanka Patel · Kostas Daniilidis · Michael J. Black · Muhammed Kocabas

[ ExHall D ]

Abstract
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like face or body bounding boxes, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS …
Poster
Yalong Xu · Lin Zhao · Chen Gong · Guangyu Li · Di Wang · Nannan Wang

[ ExHall D ]

Abstract
Top-down approaches for human pose estimation (HPE) have reached a high level of sophistication, exemplified by models such as HRNet and ViTPose. Nonetheless, the low efficiency of top-down methods is a recognized issue that has not been sufficiently explored in current research. Our analysis suggests that the primary cause of inefficiency stems from the substantial diversity found in pose samples. On one hand, simple poses can be accurately estimated without requiring the computational resources of larger models. On the other hand, a more prominent issue arises from the abundance of bounding boxes, which remain excessive even after NMS. In this paper, we present a straightforward yet effective dynamic framework called DynPose, designed to match diverse pose samples with the most appropriate models, thereby ensuring optimal performance and high efficiency. Specifically, the framework contains a lightweight router and two pre-trained HPE models: one small and one large. The router is optimized to classify samples and dynamically determine the appropriate inference paths. Extensive experiments demonstrate the effectiveness of the framework. For example, using ResNet-50 and HRNet-W32 as the pre-trained models, our DynPose achieves an almost 50\% increase in speed over HRNet-W32 while maintaining the same-level accuracy. More importantly, the framework can be …
Poster
Huan Ren · Wenfei Yang · Shifeng Zhang · Tianzhu Zhang

[ ExHall D ]

Abstract
Category-level object pose estimation aims to determine the pose and size of arbitrary objects within given categories. Existing two-stage correspondence-based methods first establish correspondences between camera and object coordinates, and then acquire the object pose using a pose fitting algorithm. In this paper, we conduct a comprehensive analysis of this paradigm and introduce two crucial essentials: 1) shape-sensitive and pose-invariant feature extraction for accurate correspondence prediction, and 2) outlier correspondence removal for robust pose fitting. Based on these insights, we propose a simple yet effective correspondence-based method called SpotPose, which includes two stages. During the correspondence prediction stage, pose-invariant geometric structure of objects is thoroughly exploited to facilitate shape-sensitive holistic interaction among keypoint-wise features. During the pose fitting stage, outlier scores of correspondences are explicitly predicted to facilitate efficient identification and removal of outliers. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate that the proposed SpotPose outperforms state-of-the-art approaches by a large margin. Our code will be released.
Poster
Ming-Feng Li · Xin Yang · Fu-En Wang · Hritam Basak · Yuyin Sun · Shreekant Gayaka · Min Sun · Cheng-Hao Kuo

[ ExHall D ]

Abstract
6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object’s appearance and geometry, remains challenging.To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model.Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness.We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when …
Poster
Bin Tan · Rui Yu · Yujun Shen · Nan Xue

[ ExHall D ]

Abstract
This paper presents PlanarSplatting, an ultra-fast and accurate surface reconstruction approach for multiview indoor images. We take the 3D planes as the main objective due to their compactness and structural expressiveness in indoor scenes, and develop an explicit optimization framework that learns to fit the expected surface of indoor scenes by splatting the 3D planes into 2.5D depth and normal maps. As our PlanarSplatting operates directly on the 3D plane primitives, it eliminates the dependencies on 2D/3D plane detection and plane matching and tracking for planar surface reconstruction. Furthermore, the essential merits of plane-based representation plus CUDA-based implementation of planar splatting functions, PlanarSplatting reconstructs an indoor scene in 3 minutes while having significantly better geometric accuracy. Thanks to our ultra-fast reconstruction speed, the largest quantitative evaluation on the ScanNet and ScanNet++ datasets over hundreds of scenes clearly demonstrated the advantages of our method. We believe that our accurate and ultrafast planar surface reconstruction method will be applied in the structured data curation for surface reconstruction in the future. The code of our CUDA implementation will be publicly available.
Poster
Xiuqiang Song · Li Jin · Zhengxian Zhang · Jiachen Li · Fan Zhong · Guofeng Zhang · Xueying Qin

[ ExHall D ]

Abstract
In this paper, we introduce a novel, truly prior-free 3D object tracking method that operates without given any model or training priors. Unlike existing methods that typically require pre-defined 3D models or specific training datasets as priors, which limit their applicability, our method is free from these constraints. Our method consists of a geometry generation module and a pose optimization module. Its core idea is to enable these two modules to automatically and iteratively enhance each other, thereby gradually building all the necessary information for the tracking task. We thus call the method as Bidirectional Iterative Tracking(BIT). The geometry generation module starts without priors and gradually generates high-precision mesh models for tracking, while the pose optimization module generates additional data during object tracking to further refine the generated models. Moreover, the generated 3D models can be stored and easily reused, allowing for seamless integration into various other tracking systems, not just our methods. Experimental results demonstrate that BIT outperforms many existing methods, even those that extensively utilize prior knowledge, while BIT does not rely on such information. Additionally, the generated 3D models deliver results comparable to actual 3D models, highlighting their superior and innovative qualities.
Poster
Guiyu Zhao · Sheng Ao · Ye Zhang · Kai Xu · Yulan Guo

[ ExHall D ]

Abstract
Obtaining enough high-quality correspondences is crucial for robust registration. Existing correspondence refinement methods mostly follow the paradigm of outlier removal, which either fails to correctly identify the accurate correspondences under extreme outlier ratios, or select too few correct correspondences to support robust registration. To address this challenge, we propose a novel approach named Regor, which is a progressive correspondence regenerator that generates higher-quality matches whist sufficiently robust for numerous outliers. In each iteration, we first apply prior-guided local grouping and generalized mutual matching to generate the local region correspondences. A powerful center-aware three-point consistency is then presented to achieve local correspondence correction, instead of removal. Further, we employ global correspondence refinement to obtain accurate correspondences from a global perspective. Through progressive iterations, this process yields a large number of high-quality correspondences. Extensive experiments on both indoor and outdoor datasets demonstrate that the proposed Regor significantly outperforms existing outlier removal techniques. More critically, our approach obtain 10 times more correct correspondences than outlier removal methods. As a result, our method is able to achieve robust registration even with weak features. The code will be released.
Poster
Amir Etefaghi Daryani · M. Usman Maqbool Bhutta · Byron Hernandez · Henry Medeiros

[ ExHall D ]

Abstract
Multi-view object detection in crowded environments presents significant challenges, particularly for occlusion management across multiple camera views. This paper introduces a novel approach that extends conventional multi-view detection to operate directly within each camera's image space. Our method finds objects bounding boxes for images from various perspectives without resorting to a bird’s eye view (BEV) representation. Thus, our approach removes the need for camera calibration by leveraging a learnable architecture that facilitates flexible transformations and improves feature fusion across perspectives to increase detection accuracy. Our model achieves Multi-Object Detection Accuracy (MODA) scores of 95.0% and 96.5% on the Wildtrack and MultiviewX datasets, respectively, significantly advancing the state of the art in multi-view detection. Furthermore, it demonstrates robust performance even without ground truth annotations, highlighting its resilience and practicality in real-world applications. These results emphasize the effectiveness of our calibration-free, multi-view object detector.
Poster
Theo Bodrito · Olivier Flasseur · Julien Mairal · Jean Ponce · Maud Langlois · Anne-Marie Lagrange

[ ExHall D ]

Abstract
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
Poster
Huy Nguyen · Kien Nguyen Thanh · Akila Pemasiri · Feng Liu · Sridha Sridharan · Clinton Fookes

[ ExHall D ]

Abstract
We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models will be released upon publication.
Poster
Shuo Wang · Wanting Li · Yongcai Wang · Zhaoxin Fan · Zhe Huang · xudong cai · Jian Zhao · Deying Li

[ ExHall D ]

Abstract
Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation in deep visual odometry. Specifically, when a new frame is received, it is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame pixel-to-pixel matching. The refined PFG is finally processed by deep BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training by balancing the pose loss and the matching loss to enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, Mamba and MambaVO++ demonstrate SOTA accuracy performance, while ensuring real-time …
Poster
Hongyu Sun · Qiuhong Ke · Ming Cheng · Yongcai Wang · Deying Li · Chenhui Gou · Jianfei Cai

[ ExHall D ]

Abstract
This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data—often inaccessible during online inference—and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop \textbf{Point-Cache}, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness.
Poster
Zimo Wang · Cheng Wang · Taiki Yoshino · Sirui Tao · Ziyang Fu · Tzu-Mao Li

[ ExHall D ]

Abstract
We propose a method, HotSpot, for optimizing neural signed distance functions, based on a relation between the solution of a screened Poisson equation and the distance function.Existing losses such as the eikonal loss cannot guarantee the recovered implicit function to be a distance function, even when the implicit function satisfies the eikonal equation almost everywhere.Furthermore, the eikonal loss suffers from stability issues in optimization and the remedies that introduce area or divergence minimization can lead to oversmoothing.We address these challenges by designing a loss function that when minimized can converge to the true distance function, is stable, and naturally penalize large surface area.We provide theoretical analysis and experiments on both challenging 2D and 3D datasets and show that our method provide better surface reconstruction and more accurate distance approximation.
Poster
Yuanqi Li · Jingcheng Huang · Hongshen Wang · Peiyuan Lv · Yansong Liu · Jiuming Zheng · Jie Guo · Yanwen Guo

[ ExHall D ]

Abstract
The proliferation of Light Detection and Ranging (LiDAR) technology has facilitated the acquisition of three-dimensional point clouds, which are integral to applications in VR, AR, and Digital Twin. Oriented normals, critical for 3D reconstruction and scene analysis, cannot be directly extracted from scenes using LiDAR due to its operational principles. Previous traditional or learning-based methods are prone to inaccuracies due to uneven distribution and noise due to the dependence on local geometry features. This paper addresses the challenge of estimating oriented point normals by introducing a point cloud normal estimation framework via hybrid angular and Euclidean distance encoding (HAE). Our method overcomes the limitations of local geometric information by combining angular and Euclidean spaces to extract features from both point cloud coordinates and light rays, leading to more accurate normal estimation. The core of our network consists of an angular distance encoding module, which leverages both ray directions and point coordinates for unoriented normal refinement, and a ray feature fusion module for normal orientation, that is robust to noise. We also provide a point cloud dataset with ground truth normals, generated a virtual scanner, which reflects real scanning distributions and noise profiles.
Poster
Jiangbei Hu · Yanggeng Li · Fei Hou · Junhui Hou · Zhebin Zhang · Shengfa Wang · Na Lei · Ying He

[ ExHall D ]

Abstract
Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large 3D shape datasets, which is costly and necessitates re-training for new datasets. This paper presents a novel neural framework, *LoSF-UDF*, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns in localized regions, prompting us to develop a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. Despite being highly lightweight, with only 653 KB of trainable parameters and a modest-sized training dataset with 0.5 GB storage, our method enables efficient and robust surface reconstruction from point clouds without requiring for shape-specific training. Furthermore, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We conduct comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to …
Poster
An Li · Zhe Zhu · Mingqiang Wei

[ ExHall D ]

Abstract
Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors.Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we designed the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale.Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.
Poster
Ziyi Wang · Yanran Zhang · Jie Zhou · Jiwen Lu

[ ExHall D ]

Abstract
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the FIRST unified pre-training method that can be seamlessly applied to point clouds of ANY scale and 3D models of ANY architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones.
Poster
Ziyin Zeng · Mingyue Dong · Jian Zhou · Huan Qiu · Zhen Dong · Man Luo · Bijun Li

[ ExHall D ]

Abstract
Due to the irregular and disordered data structure in 3D point clouds, prior works have focused on designing more sophisticated local representation methods to capture these complex local patterns. However, the recognition performance has saturated over the past few years, indicating that increasingly complex and redundant designs no longer make improvements to local learning. This phenomenon prompts us to diverge from the trend in 3D vision and instead pursue an alternative and successful solution: deeper neural networks. In this paper, we propose DeepLA-Net, a series of very deep networks for point cloud analysis. The key insight of our approach is to exploit a small but mighty local learning block, which reduces 10× fewer FLOPs, enabling the construction of very deep networks. Furthermore, we design a training supervision strategy to ensure smooth gradient backpropagation and optimization in very deep networks. We construct the DeepLA-Net family with a depth of up to 120 blocks --- at least 5× deeper than recent methods --- trained on a single RTX 3090. An ensemble of the DeepLA-Net achieves state-of-the-art performance on classification and segmentation tasks of S3DIS Area5 (+2.2\% mIoU), ScanNet test set (+1.6\% mIoU), ScanObjectNN (+2.1\% OA), and ShapeNetPart (+0.9\% cls.mIoU).
Poster
Chengzhi Wu · Yuxin Wan · Hao Fu · Julius Pfrommer · Zeyun Zhong · Junwei Zheng · Jiaming Zhang · Jürgen Beyerer

[ ExHall D ]

Abstract
Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.
Poster
Jianan Ye · Weiguang Zhao · Xi Yang · Guangliang Cheng · Kaizhu Huang

[ ExHall D ]

Abstract
Point cloud anomaly detection under the anomaly-free setting poses significant challenges as it requires accurately capturing the features of 3D normal data to identify deviations indicative of anomalies. Current efforts focus on devising reconstruction tasks, such as acquiring normal data representations by restoring normal samples from altered, pseudo-anomalous counterparts. Our findings reveal that distributing attention equally across normal and pseudo-anomalous data tends to dilute the model's focus on anomalous deviations. The challenge is further compounded by the inherently disordered and sparse nature of 3D point cloud data. In response to those predicaments, we introduce an innovative approach that emphasizes learning point offsets, targeting more informative pseudo-abnormal points, thus fostering more effective distillation of normal data representations. We also have crafted an augmentation technique that is steered by normal vectors, facilitating the creation of credible pseudo anomalies that enhance the efficiency of the training process. Our comprehensive experimental evaluation on the Anomaly-ShapeNet and Real3D-AD datasets evidences that our proposed method outperforms existing state-of-the-art approaches, achieving an average enhancement of 9.0% and 1.4% in the AUC-ROC detection metric across these datasets, respectively.
Poster
Shaocheng Yan · Yiming Wang · Kaiyan Zhao · Pengcheng Shi · Zhenjun Zhao · Yongjun Zhang · Jiayuan Li

[ ExHall D ]

Abstract
Heuristic information for consensus set sampling is essential for correspondence-based point cloud registration, but existing approaches typically rely on supervised learning or expert-driven parameter tuning. In this work, we propose HeMoRa, a new unsupervised framework that trains a Heuristic information Generator (HeGen) to estimate sampling probabilities for correspondences using a Multi-order Reward Aggregator (MoRa) loss. The core of MoRa is to train HeGen through extensive trials and feedback, enabling unsupervised learning. While this process can be implemented using policy optimization, directly applying the policy gradient to optimize HeGen presents challenges such as sensitivity to noise and low reward efficiency. To address these issues, we propose a Maximal Reward Propagation (MRP) mechanism that enhances the training process by prioritizing noise-free signals and improving reward utilization. Experimental results show that equipped with HeMoRa, the consensus set sampler achieves improvements in both robustness and accuracy. For example, on the 3DMatch dataset with FCGF feature, the registration recall of our unsupervised methods (Ours+SM and Ours+SC2) even outperforms the state-of-the-art supervised method VBreg. Our code is available at \href{https://anonymous.4open.science/r/HeMoRa-CVPR/README.md}{\texttt{HeMoRa}}.
Poster
Zihui Zhang · Weisheng Dai · Hongtao Wen · Bo Yang

[ ExHall D ]

Abstract
We study the hard problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two popular indoor scene datasets and a challenging outdoor dataset show that our LogoSP clearly surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
Poster
Runmao Yao · Yi Du · Zhuoqun Chen · Haoze Zheng · Chen Wang

[ ExHall D ]

Abstract
Room reidentification (ReID) is a challenging yet essential task with numerous applications in fields such as augmented reality (AR) and homecare robotics. Existing visual place recognition (VPR) methods, which typically rely on global descriptors or aggregate local features, often struggle in cluttered indoor environments densely populated with man-made objects. These methods tend to overlook the crucial role of object-oriented information. To address this, we propose AirRoom, an object-aware pipeline that integrates multi-level object-oriented information—from global context to object patches, object segmentation, and keypoints—utilizing a coarse-to-fine retrieval approach. Extensive experiments on four newly constructed datasets—MPReID, HMReID, GibsonReID, and ReplicaReID—demonstrate that AirRoom outperforms state-of-the-art (SOTA) models across nearly all evaluation metrics, with improvements ranging from 6% to 80%. Moreover, AirRoom exhibits significant flexibility, allowing various modules within the pipeline to be substituted with different alternatives without compromising overall performance. It also shows robust and consistent performance under diverse viewpoint variations.
Poster
Fajwel Fogel · Yohann PERRON · Nikola Besic · Laurent Saint-André · Agnès Pellissier-Tanon · Thomas Boudras · Martin Schwartz · Ibrahim Fayad · Alexandre d'Aspremont · Loic Landrieu · Philippe Ciais

[ ExHall D ]

Abstract
Estimating canopy height and its changes at meter resolution from satellite imagery is a significant challenge in computer vision with critical environmental applications.However, the lack of open-access datasets at this resolution hinders the reproducibility and evaluation of models. We introduce Open-Canopy, the first open-access, country-scale benchmark for very high-resolution (1.5 m) canopy height estimation, covering over 87,000 km² across France with 1.5 m resolution satellite imagery and aerial LiDAR data.Additionally, we present Open-Canopy-∆, a benchmark for canopy height change detection between images from different years at tree level—a challenging task for current computer vision models. We evaluate state-of-the-art architectures on these benchmarks, highlighting significant challenges and opportunities for improvement. Our datasets and code are publicly available at [URL].
Poster
Xin Jin · Haisheng Su · Kai Liu · CONG MA · Wei Wu · Fei HUI · Junchi Yan

[ ExHall D ]

Abstract
Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and …
Poster
Qiming Xia · Wenkai Lin · Haoen Xiang · Xun Huang · Siheng Chen · Zhen Dong · Cheng Wang · Chenglu Wen

[ ExHall D ]

Abstract
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently, DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks. The code will be publicly available.
Poster
Longyu Yang · Ping Hu · Shangbo Yuan · Lu Zhang · Jun Liu · Heng Tao Shen · Xiaofeng Zhu

[ ExHall D ]

Abstract
Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to process geometric and reflectance features independently initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.
Poster
R.D. Lin · Pengcheng Weng · Yinqiao Wang · Han Ding · Jinsong Han · Fei Wang

[ ExHall D ]

Abstract
LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. We observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches.
Poster
Zakaria Laskar · Tomas Vojir · Matej Grcic · Iaroslav Melekhov · Shankar Gangisetty · Juho Kannala · Jiri Matas · Giorgos Tolias · C.V. Jawahar

[ ExHall D ]

Abstract
Before deployment in the real-world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only either knowns or unknowns - but not both, which is required to establish “in the wild” suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, featuring a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in-domain evaluation. The test set consists of a static and temporal part, with the later comprised of videos. The dataset provides annotations for both closed-set (knowns) and anomalies, enabling closed-set and open-set evaluation. The dataset covers diverse conditions, such as domain and cross-sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of currentstate-of-the-art methods confirm the need for improvements especially in domain-generalization, small and large object segmentation. The benchmarking code and the dataset will be made available.
Poster
Ben Agro · Sergio Casas · Patrick Wang · Thomas Gilles · Raquel Urtasun

[ ExHall D ]

Abstract
To perceive, humans use memory to fill in gaps caused by our limited visibility, whether due to occlusion or our narrow field of view. However, most 3D object detectors are limited to using sensor evidence from a short temporal window (0.1s-0.3s). In this work, we present a simple and effective add-on for enhancing any existing 3D object detector with long-term memory regardless of its sensor modality (e.g., LiDAR, camera) and network architecture. We propose a model to effectively align and fuse object proposals from a detector with object proposals from a memory bank of past predictions, exploiting trajectory forecasts to align proposals across time. We propose a novel schedule to train our model on temporal data that balances data diversity and the gap between training and inference. By applying our method to existing LiDAR and camera-based detectors on the Waymo Open Dataset (WOD) and Argoverse 2 Sensor (AV2) dataset, we demonstrate significant improvements in detection performance (+2.5 to +7.6 AP points). Our method attains the best performance on the WOD 3D detection leaderboard among online methods (excluding ensembles or test-time augmentation).
Poster
Cédric Vincent · Taehyoung Kim · Henri Meeß

[ ExHall D ]

Abstract
Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach—suited to onboard real-time inference—achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, …
Poster
Lingdong Kong · Dongyue Lu · Xiang Xu · Lai Xing Ng · Wei Tsang Ooi · Benoit Cottereau

[ ExHall D ]

Abstract
Cross-platform adaptation in event-based dense perception is crucial for deploying event cameras across diverse settings, such as vehicles, drones, and quadrupeds, each with unique motion dynamics, viewpoints, and class distributions. In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. Our approach comprises three key components: i) Event Activation Prior (EAP), which identifies high-activation regions in the target domain to minimize prediction entropy, fostering confident, domain-adaptive predictions; ii) EventBlend, a data-mixing strategy that integrates source and target event voxel grids based on EAP-driven similarity and density maps, enhancing feature alignment; and iii) EventMatch, a dual-discriminator technique that aligns features from source, target, and blended domains for better domain-invariant learning. To holistically assess cross-platform adaptation abilities, we introduce EXPo, a large-scale benchmark with diverse samples across vehicle, drone, and quadruped platforms. Extensive experiments validate our effectiveness, demonstrating substantial gains over popular adaptation methods. We hope this work can pave the way for adaptive, high-performing event perception across diverse and complex environments.
Poster
Tianchen Deng · Guole Shen · Chen Xun · Shenghai Yuan · Tongxing Jin · Hongming Shen · Yanbo Wang · Jingchuan Wang · Hesheng Wang · Danwei Wang · Weidong Chen

[ ExHall D ]

Abstract
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulty in large indoor scenes and long sequences. Existing multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative SLAM framework with distributed mapping and camera tracking, joint scene representation, intra-to-inter loop closure, and multi-submap fusion. Specifically, our proposed distributed neural mapping and tracking framework only needs peer-to-peer communication, which can greatly improve multi-agent cooperation and communication efficiency. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world indoor neural slam (INS) dataset covering both single-agent and multi-agent scenarios, ranging from small room to large-scale scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the community. Experiments onvarious datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset …
Poster
Xin Ye · Burhaneddin Yaman · Sheng Cheng · Feng Tao · Abhirup Mallik · Liu Ren

[ ExHall D ]

Abstract
Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.
Poster
Dubing Chen · Huan Zheng · Jin Fang · Xingping Dong · Xianfei Li · Wenlong Liao · Tao He · Pai Peng · Jianbing Shen

[ ExHall D ]

Abstract
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, with a focus on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistencies, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and provide distinctive contributions across various modules in the general VisionOcc framework.To effectively fuse temporal signals of different representations, we introduce a novel fusion strategy by reinterpreting vanilla RNNs. This approach utilizes gradient descent on features to unify the integration of diverse temporal information. Extensive experiments on NuScenes demonstrate that GDFusion significantly outperforms established baselines, delivering a consistent increase in mIoU between 2.2\% to 4.7\% with less memory consumption. Codes will be made publicly available.
Poster
Zhimin Liao · Ping Wei · Shuaijia Chen · Haoxuan Wang · Ziyang Ren

[ ExHall D ]

Abstract
3D occupancy and scene flow offer a detailed and dynamic representation of 3D space, which is particularly critical for autonomous driving applications. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel sparse-based method to model long-term dynamic information, conserving computation while preserving spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and MAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to less than 8.7GB.
Poster
Pan Yin · Kaiyu Li · Xiangyong Cao · Jing Yao · Lei Liu · Xueru Bai · Feng Zhou · Deyu Meng

[ ExHall D ]

Abstract
Recently, road graph extraction has garnered increasing attention due to its crucial role in autonomous driving, navigation, etc. However, accurately and efficiently extracting road graphs remains a persistent challenge, primarily due to the severe scarcity of labeled data. To address this limitation, we collect a global-scale satellite road graph extraction dataset, i.e. Global-Scale dataset. Specifically, the Global-Scale dataset is 20× larger than the largest existing public road extraction dataset and spans over 13,800 km2 globally. Additionally, we develop a novel road graph extraction model, i.e. SAM-Road++, which adopts a node-guided resampling method to alleviate the mismatch issue between training and inference in SAM-Road, a pioneering state-of-the-art road graph extraction model. Furthermore, we propose a simple yet effective extended-line'' strategy in SAM-Road++ to mitigate the occlusion issue on the road. Extensive experiments demonstrate the validity of the collected Global-Scale dataset and the proposed SAM-Road++ method, particularly highlighting its superior predictive power in unseen regions.
Poster
Chenxu Zhou · Lvchang Fu · Sida Peng · Yunzhi Yan · Zhanhua Zhang · chen yong · Jiazhi Xia · Xiaowei Zhou

[ ExHall D ]

Abstract
This paper targets the challenge of real-time LiDAR re-simulation in dynamic driving scenarios. Recent approaches utilize neural radiance fields combined with the physical modeling of LiDAR sensors to achieve high-fidelity re-simulation results. Unfortunately, these methods face limitations due to high computational demands in large-scale scenes and cannot perform real-time LiDAR rendering. To overcome these constraints, we propose LiDAR-RT, a novel framework that supports real-time, physically accurate LiDAR re-simulation for driving scenes. Our primary contribution is the development of an efficient and effective rendering pipeline, which integrates Gaussian primitives and hardware-accelerated ray tracing technology. Specifically, we model the physical properties of LiDAR sensors using Gaussian primitives with learnable parameters and incorporate scene graphs to handle scene dynamics. Building upon this scene representation, our framework first constructs a bounding volume hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views through a differentiable rendering algorithm. Importantly, our framework supports realistic rendering with flexible scene editing operations and various sensor configurations. Extensive experiments across multiple public benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of rendering quality and efficiency. Codes will be released to ensure reproducibility.
Poster
Jingqiu Zhou · Lue Fan · Linjiang Huang · Zhaoxiang Zhang · Xiaoyu Shi · Si Liu · Hongsheng Li

[ ExHall D ]

Abstract
Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting.However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views.For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data.Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset.In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin.
Poster
Chaojun Ni · Guosheng Zhao · Xiaofeng Wang · Zheng Zhu · Wenkang Qin · Guan Huang · Chen Liu · Yuyin Chen · Yida Wang · Xueyang Zhang · Yifei Zhan · Kun Zhan · Peng Jia · XianPeng Lang · Xingang Wang · Wenjun Mei

[ ExHall D ]

Abstract
Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent work, DriveDreamer4D, has demonstrated that integrating world model knowledge alleviates these issues. Although the training-free integration approach is efficient, it still struggles to render larger maneuvers, such as multi-lane shifts.Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, based on the world model, DriveRestorer is proposed to mitigate ghosting artifacts via online restoration. Additionally, we propose the progressive data update strategy to ensure high-quality rendering for larger maneuvers. Notably, ReconDreamer is the first method to effectively render in large maneuvers (e.g., across multiple lanes, spanning up to 6 meters). Additionally, experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with a relative improvement by 24.87\%, 6.72\%, and 29.97\%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87\% in the NTA-IoU metric and a comprehensive user study.
Poster
Shuhan Tan · John Wheatley Lambert · Hong Jeon · Sakshum Kulshrestha · Yijing Bai · Jing Luo · Dragomir Anguelov · Mingxing Tan · Chiyu “Max” Jiang

[ ExHall D ]

Abstract
The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the …
Poster
Ziyang Xie · Zhizheng Liu · Zhenghao Peng · Wayne Wu · Bolei Zhou

[ ExHall D ]

Abstract
Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photo-realistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods. Code and data will be made publicly available.
Poster
Yuchen Xia · Quan Yuan · Guiyang Luo · Xiaoyuan Fu · Yang Li · Xuanhan Zhu · Tianyou Luo · Siheng Chen · Jinglin Li

[ ExHall D ]

Abstract
Collaborative perception in autonomous driving significantly enhances the perception capabilities of individual agents. Immutable heterogeneity in collaborative perception, where agents have different and fixed perception networks, presents a major challenge due to the semantic gap in their exchanged intermediate features without modifying the perception networks. Most existing methods bridge the semantic gap through interpreters. However, they either require training a new interpreter for each new agent type, limiting extensibility, or rely on a two-stage interpretation via an intermediate standardized semantic space, causing cumulative semantic loss. To achieve both extensibility in immutable heterogeneous scenarios and low-loss feature interpretation, we propose PolyInter, a polymorphic feature interpreter. It contains an extension point through which emerging new agents can seamlessly integrate by overriding only their specific prompts, which are learnable parameters intended to guide the interpretation, while reusing PolyInter’s remaining parameters. By leveraging polymorphism, our design ensures that single interpreter is sufficient to accommodate diverse agents and interpret their features into the ego agent’s semantic space. Experiments conducted on OPV2V dataset demonstrate that PolyInter improves collaborative perception precision by up to 11.1% compared to SOTA interpreters, while comparable results can be achieved by training only 1.4% of PolyInter's parameters when adapting to new agents.
Poster
Zebin Xing · Xingyu Zhang · Yang Hu · Bo Jiang · Tong He · Qian Zhang · Xiaoxiao Long · Wei Yin

[ ExHall D ]

Abstract
In this paper, we propose an end-to-end autonomous driving method focused on generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory; multimodal trajectory generation aims to provide multiple trajectory candidates, enhancing the system's reliability and human-computer interaction. Although recent works have shown success, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. To tackle the issue of information inconsistency, GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information.Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. Compared to the state-of-the-art methods, which scored 84.0, our model achieves an improvement of …
Poster
Zikang Zhou · Hengjian Zhou · Haibo Hu · Zihao WEN · Jianping Wang · Yung-Hui Li · Yu-Kai Huang

[ ExHall D ]

Abstract
Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and misaligned mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or rule-based trajectory selection, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the …
Poster
Yichen Xie · Runsheng Xu · Tong He · Jyh-Jing Hwang · Katie Z Luo · Jingwei Ji · Hubert Lin · Letian Chen · Yiren Lu · Zhaoqi Leng · Dragomir Anguelov · Mingxing Tan

[ ExHall D ]

Abstract
The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches—which directly learn from sensor inputs to generate planning trajectories without human annotations—often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in perspective view space rather than the native 3D space that autonomous vehicles plan in. To this end, we propose PaLI-Driver, based on the popular PaLI vision-language model. PaLI-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multiview and multi-frame visual inputs and enables better pre diction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and our in-house collected dataset X-Planning. Results show that PaLI-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of …
Poster
Yifan Wang · Jian Zhao · Zhaoxin Fan · Xin Zhang · Xuecheng Wu · Yudian Zhang · Lei Jin · Xinyue Li · Gang Wang · Mengxi Jia · Ping Hu · Zheng Zhu · Xuelong Li

[ ExHall D ]

Abstract
Unmanned Aerial Vehicles (UAVs) are widely adopted across various fields, yet they raise significant privacy and safety concerns, demanding robust monitoring solutions. Existing anti-UAV methods primarily focus on position tracking but fail to capture UAV behavior and intent. To address this, we introduce a novel task—UAV Tracking and Intent Understanding (UTIU)—which aims to track UAVs while inferring and describing their motion states and intent for a more comprehensive monitoring approach. To tackle the task, we propose JTD-UAV, the first joint tracking, and intent description framework based on large language models. Our dual-branch architecture integrates UAV tracking with Visual Question Answering (VQA), allowing simultaneous localization and behavior description. To benchmark this task, we introduce the TDUAV dataset, the largest dataset for joint UAV tracking and intent understanding, featuring 1,328 challenging video sequences, over 163K annotated thermal frames, and 3K VQA pairs. Our benchmark demonstrates the effectiveness of JTD-UAV, and both the dataset and code will be publicly available.
Poster
Ruiqi Qiu · JUN GONG · Xinyu Zhang · Siqi Luo · Bowen Zhang · Yi Cen

[ ExHall D ]

Abstract
The ability to adapt to varying observation lengths is crucial for trajectory prediction tasks, particularly in scenarios with limited observation lengths or missing data. Existing approaches mainly focus on introducing novel architectures or additional structural components, which substantially increase model complexity and present challenges for integration into existing models. We argue that current network architectures are sufficiently sophisticated to handle the Observation Length Shift problem, with the key challenge lying in improving feature representation for trajectories with limited lengths. To tackle this issue, we introduce a general and effective contrastive learning approach, called Contrastive Learning for Length Shift (CLLS). By incorporating contrastive learning during the training phase, our method encourages the model to extract length-invariant features, thus mitigating the impact of observation length variations. Furthermore, to better accommodate length adaptation tasks, we introduce a lightweight RNN network that, combined with CLLS, achieves state-of-the-art performance in both general prediction and observation length shift tasks. Experimental results demonstrate that our approach outperforms existing methods across multiple widely-used trajectory prediction datasets.
Poster
Dianze Li · Jianing Li · Xu Liu · Xiaopeng Fan · Yonghong Tian

[ ExHall D ]

Abstract
Integrating frames and events has become a widely accepted solution for various tasks in challenging scenarios. However, most multimodal methods directly convert events into image-like formats synchronized with frames and process each stream through separate two-branch backbones, making it difficult to fully exploit the spatiotemporal events while limiting inference frequency to the frame rate. To address these problems, we propose a novel asynchronous collaborative graph representation, namely ACGR, which is the first trial to explore a unified graph framework for asynchronously processing frames and events with high performance and low latency. Technically, we first construct unimodal graphs for frames and events to preserve their spatiotemporal properties and sparsity. Then, an asynchronous collaborative alignment module is designed to align and fuse frames and events into a unified graph and the ACGR is generated through graph convolutional networks. Finally, we innovatively introduce domain adaptation to enable cross-modal interactions between frames and events by aligning their feature spaces. Experimental results show that our approach outperforms state-of-the-art methods in both object detection and depth estimation tasks, while significantly reducing computational latency and achieving real-time inference up to 200 Hz. Our code will be open-sourced and available in the supplementary material.
Poster
Huangyue Yu · Baoxiong Jia · Yixin Chen · Yandan Yang · Puhao Li · Rongpeng Su · Jiaxin Li · Qing Li · Wei Liang · Song-Chun Zhu · Tengyu Liu · Siyuan Huang

[ ExHall D ]

Abstract
Embodied AI (EAI) research depends on high-quality and diverse 3D scenes to enable effective skill acquisition, sim-to-real transfer, and domain generalization. Recent 3D scene datasets are still limited in scalability due to their dependence on artist-driven designs and challenges in replicating the diversity of real-world objects. To address these limitations and automate the creation of 3D simulatable scenes, we present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility. Leveraging the recent advancements in object-level modeling, we provide each object with a curated set of candidates, ranked through human annotation for optimal replacements based on geometry, texture, and functionality. These annotations enable a novel multi-modal alignment model, SCAN2SIM, which facilitates automated and high-quality asset replacement. We further validate the utility of our dataset with two benchmarks: Micro-Scene Synthesos for small object layout generation and cross-domain vision-language navigation (VLN). Results confirm the potential of METASCENES to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research.
Poster
Dongyue Lu · Lingdong Kong · Tianxin Huang · Gim Hee Lee

[ ExHall D ]

Abstract
Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. We will release our datasets and source-code as open source upon paper acceptance.
Poster
Chunlin Yu · Hanqing Wang · Ye Shi · Haoyang Luo · Sibei Yang · Jingyi Yu · Jingya Wang

[ ExHall D ]

Abstract
3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.
Poster
Qingqing Zhao · Yao Lu · Moo Jin Kim · Zipeng Fu · Zhuoyang Zhang · Yecheng Wu · Max Li · Qianli Ma · Song Han · Chelsea Finn · Ankur Handa · Tsung-Yi Lin · Gordon Wetzstein · Ming-Yu Liu · Donglai Xiang

[ ExHall D ]

Abstract
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17\% in real-world manipulation tasks and 6\% in simulation benchmarks.
Poster
Zhenyu Wu · Yuheng Zhou · Xiuwei Xu · Ziwei Wang · Haibin Yan

[ ExHall D ]

Abstract
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training.In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks.Therefore, we propose an efficient policy adaptation framework to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy.Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. Extensive experimental results on OVMM and the real world demonstrate that our method achieves a 4.2\% higher success rate than the state-of-the-art mobile manipulation, and …
Poster
Yuheng Ji · Huajie Tan · Jiayu Shi · Xiaoshuai Hao · Yuan Zhang · Hengyuan Zhang · Pengwei Wang · Mengdi Zhao · Yao Mu · Pengju An · Xinda Xue · Qinghang Su · Huaihai Lyu · Xiaolong Zheng · Jiaming Liu · Zhongyuan Wang · Shanghang Zhang

[ ExHall D ]

Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: \textbf{Planning Capability}, which involves decomposing complex manipulation instructions into manageable sub-tasks; \textbf{Affordance Perception}, the ability to recognize and interpret the affordances of interactive objects; and \textbf{Trajectory Prediction}, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce \textbf{ShareRobot}, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed \textbf{RoboBrain}, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities.Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various obotic tasks, highlighting its potential to advance robotic brain capabilities.
Poster
Tianxing Chen · Yao Mu · Zhixuan Liang · Zanxin Chen · ShijiaPeng · Qiangyu Chen · Mingkun Xu · Ruizhen Hu · Hongyuan Zhang · Xuelong Li · Ping Luo

[ ExHall D ]

Abstract
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
Poster
Zhixuan Liang · Yao Mu · Yixiao Wang · Fei Ni · Tianxing Chen · Wenqi Shao · Wei Zhan · Masayoshi Tomizuka · Ping Luo · Mingyu Ding

[ ExHall D ]

Abstract
Dexterous manipulation with contact-rich interactions is crucial for advanced robotics. While recent diffusion-based planning approaches show promise for simpler manipulation tasks, they often produce unrealistic ghost states (e.g., the object automatically moves without hand contact) or lack adaptability when handling complex sequential interactions. In this work, we introduce DexDiffuser, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. DexDiffuser models joint state-action dynamics through a dual-phase diffusion process which consists of pre-interaction contact alignment and post-contact goal-directed control, enabling goal-adaptive generalizable dexterous manipulation. Additionally, we incorporate dynamics model-based dual guidance and leverage large language models for automated guidance function generation, enhancing generalizability for physical interactions and facilitating diverse goal adaptation through language cues. Experiments on physical interaction tasks such as door opening, pen and block re-orientation, and hammer striking demonstrate DexDiffuser's effectiveness on goals outside training distributions, achieving over twice the average success rate (59.2% vs. 29.5%) compared to existing methods. Our framework achieves 70.0% success on 30-degree door opening, 40.0% and 36.7% on pen and block half-side re-orientation respectively, and 46.7% on hammer nail half drive, highlighting its robustness and flexibility in contact-rich manipulation.
Poster
Guangyan Chen · Te Cui · Meiling Wang · Yang Chengcai · Mengxiao Hu · Haoyang Lu · Yao Mu · Zicai Peng · Tianxing Zhou · XINRAN JIANG · Yi Yang · Yufeng Yue

[ ExHall D ]

Abstract
Learning from demonstration is a powerful method for robotic skill acquisition. However, the significant expense of collecting such action-labeled robot data presents a major bottleneck. Video data, a rich data source encompassing diverse behavioral and physical knowledge, emerges as a promising alternative. In this paper, we present GraphMimic, a novel paradigm that leverages video data via graph-to-graphs generative modeling, which pre-trains models to generate future graphs conditioned on the graph within a video frame. Specifically, GraphMimic abstracts video frames into object and visual action vertices, and constructs graphs for state representations. The graph generative modeling network then effectively models internal structures and spatial relationships within the constructed graphs, aiming to generate future graphs. The generated graphs serve as conditions for the control policy, mapping to robot actions. Our concise approach captures important spatial relations and enhances future graph generation accuracy, enabling the acquisition of robust policies from limited action-labeled data. Furthermore, the transferable graph representations facilitate the effective learning of manipulation skills from cross-embodiment videos. Our experiments exhibit that GraphMimic achieves superior performance using merely 20% action-labeled data. Moreover, our method outperforms the state-of-the-art method by over 17% and 23% in simulation and real-world experiments, and delivers improvements of over …
Poster
Yun Liu · Chengwen Zhang · Ruofan Xing · Bingda Tang · Bowen Yang · Li Yi

[ ExHall D ]

Abstract
Understanding how humans cooperatively rearrange household objects is critical for VR/AR and human-robot interaction. However, in-depth studies on modeling these behaviors are under-researched due to the lack of relevant datasets. We fill this gap by presenting CORE4D, a novel large-scale 4D human-object-human interaction dataset focusing on collaborative object rearrangement, which encompasses diverse compositions of various object geometries, collaboration modes, and 3D scenes. With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration retargeting strategy to augment motions to a variety of novel objects. Leveraging this approach, CORE4D comprises a total of 11K collaboration sequences spanning 3K real and virtual object shapes. Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis. Extensive experiments demonstrate the effectiveness of our collaboration retargeting strategy and indicate that CORE4D has posed new challenges to existing human-object interaction generation methodologies.
Poster
Alpár Cseke · Shashank Tripathi · Sai Kumar Dwivedi · Arjun Lakshmipathy · Agniv Chatterjee · Michael J. Black · Dimitrios Tzionas

[ ExHall D ]

Abstract
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways:(1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and …
Poster
Yiming Dou · Wonseok Oh · Yuqing Luo · Antonio Loquercio · Andrew Owens

[ ExHall D ]

Abstract
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands interacting with a 3D reconstruction of scene? We focus on human hands since they are versatile in their actions (e.g., tapping, scratching, patting) and very important to simulate human-like avatars in virtual reality applications. To predict the sound of hands, we train a video and hand-conditioned rectified flow model on a novel dataset of 3D-aligned hand-scene interactions with synchronized audio. Evaluation through psychophysical studies shows that our generated sounds are frequently indistinguishable from real sounds, outperforming baselines lacking hand pose or visual scene information. Through quantitative evaluations, we show that the generated sounds accurately convey material properties and actions. We will release our code and dataset to support further development in interactive 3D reconstructions.
Poster
Jinglei Zhang · Jiankang Deng · Chao Ma · Rolandos Alexandros Potamias

[ ExHall D ]

Abstract
Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models will be made publicly available for research purposes.
Poster
Jeonghwan Kim · Jisoo Kim · Jeonghyeon Na · Hanbyul Joo

[ ExHall D ]

Abstract
To enable machines to understand the way humans interact with the physical world in daily life, 3D interaction signals should be captured in natural settings, allowing people to engage with multiple objects in a range of sequential and casual manipulations. To achieve this goal, we introduce our ParaHo_x0008_me system designed to capture dynamic 3D movements of humans and objects within a common home environment. Our system features a multi-view setup with 70 synchronized RGB cameras, along with wearable motion capture devices including an IMU-based body suit and hand motion capture gloves. By leveraging the ParaHome system, we collect a new human-object interaction dataset, including 486 minutes of sequences across 208 captures with 38 participants, offering advancements with three key aspects: (1) capturing body motion and dexterous hand manipulation motion alongside multiple objects within a contextual home environment; (2) encompassing sequential and concurrent manipulations paired with text descriptions; and (3) including articulated objects with multiple parts represented by 3D parameterized models. We present detailed design justifications for our system, and perform key generative modeling experiments to demonstrate the potential of our dataset.
Poster
Jing Gao · Ce Zheng · Laszlo Jeni · Zackory Erickson

[ ExHall D ]

Abstract
In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.
Poster
Songpengcheng Xia · Yu Zhang · Zhuo Su · Xiaozheng Zheng · Zheng Lv · Guidong Wang · Yongjie Zhang · Qi Wu · Lei Chu · Ling Pei

[ ExHall D ]

Abstract
Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions.Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios. Code will be released upon paper acceptance.
Poster
German Barquero · Nadine Bertsch · Manojkumar Marramreddy · Carlos Chacón · Filippo Arcadu · Ferran Rigual · Nicky Sijia He · Cristina Palmero · Sergio Escalera · Yuting Ye · Robin Kips

[ ExHall D ]

Abstract
In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially …
Poster
Dong Wei · Xiaoning Sun · Xizhan Gao · Shengxiang Hu · Huaijiang Sun

[ ExHall D ]

Abstract
We investigate a new task in human motion prediction, which aims to forecast future body poses from historically observed sequences while accounting for arbitrary latency. This differs from existing works that assume an ideal scenario where future motions can be instantaneously'' predicted, thereby neglecting time delays caused by network transmission and algorithm execution. Addressing this task requires tackling two key challenges: The length of latency period can vary significantly across samples; the prediction model must be efficient. In this paper, we propose ALIEN, which treats the motion as a continuous function parameterized by a neural network, enabling predictions under any latency condition. By incorporating Mamba-like linear attention as a hyper-network and designing subsequent low-rank modulation, ALIEN efficiently learns a set of implicit neural representation weights from the observed motion to encode instance-specific information. Additionally, our model integrates the primary motion prediction task with an extra-designed variable-delay pose reconstruction task in a unified multi-task learning framework, enhancing its ability to capture richer motion patterns. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines adapted for our new task, while maintaining competitive performance in traditional prediction setting.
Poster
Cecilia Curreli · Dominik Muhle · Abhishek Saroha · Zhenzhang Ye · Riccardo Marin · Daniel Cremers

[ ExHall D ]

Abstract
Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics.
Poster
Jianwei Tang · Hong Yang · Tengyue Chen · Jian-Fang Hu

[ ExHall D ]

Abstract
Action-driven stochastic human motion prediction aims to generate future motion sequence of a pre-defined target action based on given past observed sequences performing non-traget actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the vary transition speed of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicting results being unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equiped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain action. To fuse the features retrieved from two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms previous state-of-the-art.
Poster
Jiayi Su · Youhe Feng · Zheng Li · Jinhua Song · Yangfan He · Botao Ren · Botian Xu

[ ExHall D ]

Abstract
This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.
Poster
Dian Shao · Mingfei Shi · Shengda Xu · Haodong Chen · Yongle Huang · Binglu Wang

[ ExHall D ]

Abstract
Although remarkable progress has been achieved in video generation, synthesizing physically plausible human actions remains an unresolved challenge, especially when addressing fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as “two turns on one leg with the free leg optionally below horizontal” poses substantial difficulties for current video generation methods, which often fail to produce satisfactory results. To address this, we propose FinePhys, a Fine-grained human action generation framework incorporating Physics for effective skeletal guidance. Specifically, FinePhys first performs online 2D pose estimation and then accomplishes dimension lifting through in-context learning. Recognizing that such data-driven 3D pose estimations may lack stability and interpretability, we incorporate a physics-based module that re-estimates motion dynamics using Euler-Lagrange equations, calculating joint accelerations bidirectionally across the temporal dimension. The physically predicted 3D poses are then fused with data-driven poses to provide multi-scale 2D heatmap-based guidance for the video generation process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.
Poster
Ting-Hsuan Liao · Yi Zhou · Yu Shen · Chun-Hao P. Huang · Saayan Mitra · Jia-Bin Huang · Uttaran Bhattacharya

[ ExHall D ]

Abstract
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
Poster
Xuan Wang · Kai Ruan · Xing Zhang · Gaoang Wang

[ ExHall D ]

Abstract
Text-driven motion generation has made significant strides in recent years. However, most work has primarily focused on human motion, largely ignoring the rich and diverse behaviors exhibited by animals. Modeling animal motion has important applications in wildlife conservation, animal ecology, and biomechanics. Animal motion modeling presents unique challenges due to species diversity, varied morphological structures, and different behavioral patterns in response to identical textual descriptions.To address these challenges, we propose \textbf{AniMo} for text-driven animal motion generation. AniMo consists of two stages: motion tokenization and text-to-motion generation. In the motion tokenization stage, we encode motions using a joint-aware spatiotemporal encoder with species-aware feature adaptive modulation, enabling the model to adapt to diverse skeletal structures across species. In the text-to-motion generation stage, we employ masked modeling to jointly learn the mappings between textual descriptions and motion tokens.Additionally, we introduce AniMo4D, a large-scale dataset containing 78,149 motion sequences and 185,435 textual descriptions across 114 animal species.Experimental results show that AniMo achieves superior performance on both the AniMo4D and AnimalML3D datasets, effectively capturing diverse morphological structures and behavioral patterns across various animal species.
Poster
Yifeng Ma · Jinwei Qi · Chaonan Ji · Peng Zhang · Bang Zhang · Zhidong Deng · Liefeng Bo

[ ExHall D ]

Abstract
This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.
Poster
Ruineng Li · Daitao Xing · Huiming Sun · Yuanzhou Ha · Jinglin Shen · Chiuman Ho

[ ExHall D ]

Abstract
Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications ranging from films to short-form content.
Poster
Inès Hyeonsu Kim · Seokju Cho · Gabriel Huang · Jung Yi · Joon-Young Lee · Seungryong Kim

[ ExHall D ]

Abstract
Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information.In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Code and pre-trained weights will be publicly available.
Poster
Yuhong Zhang · Guanlin Wu · Ling-Hao Chen · Zhuokai Zhao · Jing Lin · Xiaoke Jiang · Jiamin WU · Zhuoheng Li · Hao Frank Yang · Haoqian Wang · Lei Zhang

[ ExHall D ]

Abstract
In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.
Poster
Daikun Liu · Lei Cheng · Teng Wang · Changyin Sun

[ ExHall D ]

Abstract
Recent learning-based methods for event-based optical flow estimation utilize cost volumes for pixel matching but suffer from redundant computations and limited scalability to higher resolutions for flow refinement. In this work, we take advantage of the complementarity between temporally dense feature differences of adjacent event frames and cost volume and present a lightweight event-based optical flow network (EDCFlow) to achieve high-quality flow estimation at a higher resolution. Specifically, an attention-based multi-scale temporal feature difference layer is developed to capture diverse motion patterns at high resolution in a computation-efficient manner. An adaptive fusion of high-resolution difference motion features and low-resolution correlation motion features is performed to enhance motion representation and model generalization. Notably, EDCFlow can serve as a plug-and-play refinement module for RAFT-like event-based methods to enhance flow details. Extensive experiments demonstrate that EDCFlow achieves a better model accuracy-complexity trade-off compared to existing methods, offering superior generalization.
Poster
Eric Kee · Adam Pikielny · Kevin Blackburn-Matzen · Marc Levoy

[ ExHall D ]

Abstract
We describe a system to remove real-world reflections from images for consumer photography. Our system operates on linear (RAW) photos, and accepts an optional contextual photo looking in the opposite direction (e.g., the "selfie" camera on a mobile device). This optional photo helps disambiguate what should be considered the reflection. The system is trained solely on synthetic mixtures of real-world RAW images, which we combine using a reflection simulation that is photometrically and geometrically accurate. Our system comprises a base model that accepts the captured photo and optional context photo as input, and runs at 256p, followed by an up-sampling model that transforms 256p images to full resolution. The system can produce images for review at 1K in 4.5 to 6.5 seconds on a MacBook or iPhone 14 Pro. We test on RAW photos that were captured in the field and embody typical consumer photos, and show that our RAW-image simulation yields SOTA performance.
Poster
yan zaoming · pengcheng lei · Tingting Wang · Faming Fang · Junkang Zhang · Yaomin Huang · Haichuan Song

[ ExHall D ]

Abstract
Blurry video frame interpolation (BVFI), which aims to generate high-frame-rate clear videos from low-frame-rate blurry inputs, is a challenging yet significant task in computer vision. Current state-of-the-art approaches typically rely on linear or quadratic models to estimate intermediate motion. However, these methods often overlook depth-related changes, such as object size and viewing angle variations, which occur as objects move through the scene. This paper proposes a novel approach to addressing this challenge by leveraging differential curves that can describe motion velocity and depth changes.Specifically, we introduce an explicit framework, termed Differential Curve-guided Blurry Video Multi-Frame Interpolation (DC-BMFI), to eliminate the effects of depth variation caused by object motion on BVFI task. In contrast to existing methods that utilize optical flow for 2D awareness, we introduce an MPNet submodule within the DC-BMFI framework to estimate 3D scene flow, thereby enhancing the awareness of depth and velocity changes.To estimate the 3D scene flow from video frames, we propose a submodule termed UBNet to transform video frames into 3D camera space point maps and then estimate scene flow between point maps.Extensive experiments demonstrate that the proposed DC-BMFI surpasses state-of-the-art performance in simulated and real-world datasets.
Poster
Wenbo Hu · Xiangjun Gao · Xiaoyu Li · Sijie Zhao · Xiaodong Cun · Yong Zhang · Long Quan · Ying Shan

[ ExHall D ]

Abstract
Estimating video depth in open-world scenarios is challenging due to the diversity of videos in appearance, content motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that can process extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
Poster
Baorui Ma · Huachen Gao · Haoge Deng · Zhengxiong Luo · Tiejun Huang · Lulu Tang · Xinlong Wang

[ ExHall D ]

Abstract
Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data --- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a …
Poster
Daniel Geng · Charles Herrmann · Junhwa Hur · Forrester Cole · Serena Zhang · Tobias Pfaff · Tatiana Lopez-Guevara · Yusuf Aytar · Michael Rubinstein · Chen Sun · Oliver Wang · Andrew Owens · Deqing Sun

[ ExHall D ]

Abstract
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse _or_ dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as _motion prompts_. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term _motion prompt expansion_. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance.
Poster
Ryan Burgert · Yuancheng Xu · Wenqi Xian · Oliver Pilarski · Pascal Clausen · Mingming He · Li Ma · Yitong Deng · Lingxiao Li · Mohsen Mousavi · Michael Ryoo · Paul Debevec · Ning Yu

[ ExHall D ]

Abstract
Generative modeling aims to transform chaotic noise into structured outputs that align with training data distributions. In this work, we enhance video diffusion generative models by introducing motion control as a structured component within latent space sampling. Specifically, we propose a novel real-time noise warping method that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, enabling fine-grained motion control independent of model architecture and guidance type. We fine-tune modern video diffusion base models and provide a unified paradigm for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. By leveraging a real-time noise-warping algorithm that preserves spatial Gaussianity while efficiently maintaining temporal consistency, we enable flexible and diverse motion control applications with minimal trade-offs in pixel quality and temporal coherence. Extensive experiments and user studies demonstrate the advantages of our method in terms of visual quality, motion controllability, and temporal consistency, making it a robust and scalable solution for motion-controllable video synthesis.
Poster
Karran Pandey · Yannick Hold-Geoffroy · Matheus Gadelha · Niloy J. Mitra · Karan Singh · Paul Guerrero

[ ExHall D ]

Abstract
Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator’s latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and sometimes even human predictions regarding plausibility and diversity. Code will be released upon acceptance.
Poster
Wonjoon Jin · Qi Dai · Chong Luo · Seung-Hwan Baek · Sunghyun Cho

[ ExHall D ]

Abstract
This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. Since optical flow can be directly estimated from videos, this approach allows for the use of arbitrary training videos without ground-truth camera parameters. To synthesize natural object motion while supporting detailed camera control, we divide the camera-controllable video generation task into two sub-problems: object motion synthesis and flow-conditioned video synthesis. Specifically, we introduce an object motion synthesis model along with 3D-structure-based warping to generate foreground and background motions, which are fed into the flow-conditioned video diffusion model as conditional input, enabling camera-controlled video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
Poster
David Junhao Zhang · Roni Paiss · Shiran Zada · Nikhil Karnad · David E. Jacobs · Yael Pritch · Inbar Mosseri · Mike Zheng Shou · Neal Wadhwa · Nataniel Ruiz

[ ExHall D ]

Abstract
Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.
Poster
Zhenghao Zhang · Junchao Liao · Menghao Li · Zuozhuo Dai · Bingxue Qiu · Siyu Zhu · Long Qin · Weizhi Wang

[ ExHall D ]

Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
Poster
Shengzhi Wang · Yingkang Zhong · Jiangchuan Mu · Kai WU · Mingliang Xiong · Wen Fang · Mingqing Liu · Hao Deng · Bin He · Gang Li · Qingwen Liu

[ ExHall D ]

Abstract
Due to control limitations in the denoising process and the lack of training, zero-shot video editing methods often struggle to meet user instructions, resulting in generated videos that are visually unappealing and fail to fully satisfy expectations. To address this problem, we propose Align-A-Video, a video editing pipeline that incorporates human feedback through reward fine-tuning. Our approach consists of two key steps: 1) Deterministic Reward Fine-tuning. To reduce optimization costs for expected noise distributions, we propose a deterministic reward tuning strategy. This method improves tuning stability by increasing sample determinism, allowing the tuning process to be completed in minutes; 2) Feature Propagation Across Frames. We optimize a selected anchor frame and propagate its features to the remaining frames, improving both visual quality and semantic fidelity. This approach avoids temporal consistency degradation from reward optimization. Extensive qualitative and quantitative experiments confirm the effectiveness of using reward fine-tuning in Align-A-Video, significantly improving the overall quality of video generation.
Poster
Xin Jin · Simon Niklaus · Zhoutong Zhang · Zhihao Xia · Chun-Le Guo · Yuting Yang · Jiawen Chen · Chongyi Li

[ ExHall D ]

Abstract
Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.
Poster
Yifan Bian · Chuanbo Tang · Li Li · Dong Liu

[ ExHall D ]

Abstract
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9\% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream.
Poster
Zihao Zhang · Haoran Chen · Haoyu Zhao · Guansong Lu · Yanwei Fu · Hang Xu · Zuxuan Wu

[ ExHall D ]

Abstract
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
Poster
Yuantong zhang · Zhenzhong Chen

[ ExHall D ]

Abstract
Space-time video resampling aims to conduct both spatial-temporal downsampling and upsampling processes to achieve high-quality video reconstruction.Although there has been much progress, some major challenges still exist, such as how to preserve motion information during temporal resampling while avoiding blurring artifacts, and how to achieve flexible temporal and spatial resampling factors. In this paper, we introduce an Invertible Motion Steganography Module (IMSM), designed to preserve motion information in high-frame-rate videos. This module embeds motion information from high-frame-rate videos into downsampled frames with lower frame rates in a visually imperceptible manner. Its reversible nature allows the motion information to be recovered, facilitating the reconstruction of high-frame-rate videos. Furthermore, we propose a 3D implicit feature modulation technique that enables continuous spatiotemporal resampling. With tailored training strategies, our method supports flexible frame rate conversions, including non-integer changes like 30 FPS to 24 FPS and vice versa. Extensive experiments show that our method significantly outperforms existing solutions across multiple datasets in various video resampling tasks with high flexibility. Codes will be made available upon publication.
Poster
Xingguang Zhang · Nicholas M Chimitt · Xijun Wang · Yu Yuan · Stanley H. Chan

[ ExHall D ]

Abstract
Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges.In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model’s capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The …
Poster
Yiran Xu · Taesung Park · Richard Zhang · Yang Zhou · Eli Shechtman · Feng Liu · Jia-Bin Huang · Difan Liu

[ ExHall D ]

Abstract
Video super-resolution (VSR) models achieve temporal consistency but often produce blurrier results than their image-based counterparts due to limited generative capacity. This prompts the question: can we adapt a generative image upsampler for VSR while preserving temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that combines high-frequency detail with temporal stability, building on the large-scale GigaGAN image upsampler. Simple adaptations of GigaGAN for VSR led to flickering issues, so we propose techniques to enhance temporal consistency. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with 8× upsampling.
Poster
Yunpeng Qu · Kun Yuan · Qizhi Xie · Ming Sun · Chao Zhou · Jian Wang

[ ExHall D ]

Abstract
Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing overall quality and can guide us in adopting finegrained enhancement or transcoding strategies. Due to the difficulty and heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that the overall quality is closely related to the local texture of different regions and their visual saliency, we propose a Saliency-guided Local Perception VQA (SLP-VQA) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of overall quality. Our framework extracts global visual saliency and allocates attention using FusionWindow Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. Compared to SOTA methods, SLP-VQA obtains significant improvements across multiple scenarios on five VQA benchmarks. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with …
Poster
Jianyi Wang · Zhijie Lin · Meng Wei · Yang Zhao · Ceyuan Yang · Chen Change Loy · Lu Jiang

[ ExHall D ]

Abstract
Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
Poster
Weichen Dai · wu hexing · xiaoyang weng · Yuxin Zheng · Yuhang Ming · Wanzeng Kong

[ ExHall D ]

Abstract
\begin{abstract}As a fundamental visual task, optical flow estimation has widespread applications in computer vision. However, it faces significant challenges under adverse lighting conditions, where low texture and noise make accurate optical flow estimation particularly difficult.In this paper, we propose an optical flow method that employs implicit image enhancement through multi-modal synergistic training. To supplement the scene information missing in the original low-quality image, we utilize a high-low frequency feature enhancement network. The enhancement network is implicitly guided by multi-modal data and the specific subsequent tasks, enabling the model to learn multi-modal knowledge that enhances feature information suitable for optical flow estimation during inference. By using RGBD multi-modal data, the proposed method avoids the reliance on the images captured from the same view, a common limitation in traditional image enhancement methods.During training, the encoded features extracted from the enhanced images are synergistically supervised by features from the RGBD fusion as well as by the optical flow task.Experiments conducted on both synthetic and real datasets demonstrate that the proposed method significantly improves performance on public datasets.
Poster
Yutong Wang · Jiajie Teng · Jiajiong Cao · Yuming Li · Chenguang Ma · Hongteng Xu · Dixin Luo

[ ExHall D ]

Abstract
As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details.Despite the significant advancement in video face enhancement, current methods still suffer from i) long processing time and ii) inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage I, we learn the model with a regularizer mitigating the codebook collapse problem.In Stage II, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos.Experiments conducted on the VFHQ-Test dataset demonstrate that our method …
Poster
Xinan Xie · Qing Zhang · Wei-Shi Zheng

[ ExHall D ]

Abstract
While event-based deblurring have demonstrated impressive results, they are actually impractical for consumer photos captured by cell phones and digital cameras that are not equipped with the event sensor. To address this issue, we in this paper propose a novel deblurring framework called the Event Generation Deblurring (EGDeblurring), which allows to effectively deblur an image by generating event guidance describing the motion information using a diffusion model. Specifically, we design a Motion Prior Generation Diffusion Model (MPG-Diff) and a Feature Extractor (FE) to produce prior information beneficial for the deblurring task, rather than generating the raw event representation.In order to achieve effective fusion of motion prior information with blurry images and produce high-quality results, we propose a Regression Deblurring network (RDNet) embedded with a Dual-Branch Fusion Block (DBFB) incorporated with a multi-branch cross-modal attention mechanism.Experiments on multiple datasets demonstrate that our method outperforms the state-of-the-art methods.
Poster
Yanis Benidir · Nicolas Gonthier · Clement Mallet

[ ExHall D ]

Abstract
Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. Such task remains poorly addressed even in the deep learning era: it either requires large volumes of annotated data – in the semantic case – or is limited to restricted datasets for binary set-ups. Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets is the key solution but still fails handling complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data …
Poster
Hyejin Oh · Woo-Shik Kim · Sangyoon Lee · YungKyung Park · Jewon Kang

[ ExHall D ]

Abstract
Multispectral (MS) images contain richer spectral information than RGB images due to their increased number of channels and are widely used for various applications. However, achieving accurate estimation in MS images remains challenging, as previous studies have struggled with spectral diversity and the inherent entanglement between the illuminant and surface reflectance spectra. To tackle these challenges, in this paper, we propose a novel illumination spectrum estimation technique for MS images via surface reflectance modeling and spatial-spectral feature generation (ISS). The proposed technique employs a learnable spectral unmixing (SU) block to enhance surface reflectance modeling, which was unattempted in the illumination spectrum estimation, and a feature mixing block to fuse spectral and spatial features of MS images with cross-attention. The features are refined iteratively and processed through a decoder to produce an illumination spectrum estimator. Experimental results demonstrate that the proposed technique achieves state-of-the-art performance in illumination spectrum estimation in various MS image datasets.
Poster
Jinyuan Liu · Bowei Zhang · Qingyun Mei · Xingyuan Li · Yang Zou · Zhiying Jiang · Long Ma · Risheng Liu · Xin Fan

[ ExHall D ]

Abstract
Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Network, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. OLeveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Feature Embedding Block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks.
Poster
Junming Hou · Xiaoyu Chen · Ran Ran · Xiaofeng Cong · Xinyang Liu · Jian Wei You · Liang-Jian Deng

[ ExHall D ]

Abstract
Pan-sharpening technology refers to generating a high-resolution (HR) multi-spectral (MS) image with broad applications by fusing a low-resolution (LR) MS image and HR panchromatic (PAN) image. While deep learning approaches have shown impressive performance in pan-sharpening, they generally require extensive hardware with high memory and computational power, limiting their deployment on resource-constrained satellites. In this study, we investigate the use of binary neural networks (BNNs) for pan-sharpening and observe that binarization leads to distinct information degradation across different frequency components of an image. Building on this insight, we propose a novel binary pan-sharpening network, termed BNNPan, structured around the Prior-Integrated Binary Frequency (PIBF) module that features three key ingredients: Binary Wavelet Transform Convolution, Latent Diffusion Prior Compensation, and Channel-wise Distribution Calibration. Specifically, the first decomposes input features into distinct frequency components using Wavelet Transform, then applies a “divide-and-conquer” strategy to optimize binary feature learning for each component, informed by the corresponding full-precision residual statistics. The second integrates a latent diffusion prior to compensate for compromised information during binarization, while the third performs channel-wise calibration to further refine feature representation. Our BNNPan, developed using the proposed techniques, achieves promising pan-sharpening performance while maintaining favorable computational overhead. Experiments on multiple remote sensing …
Poster
Xiaoyi Liu · Hao Tang

[ ExHall D ]

Abstract
We introduce DiffFNO, a novel framework for arbitrary-scale super-resolution that incorporates a Weighted Fourier Neural Operator (WFNO) enhanced by a diffusion process. DiffFNO's adaptive mode weighting mechanism in the Fourier domain effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details that are essential for super-resolution tasks.Additionally, we propose a Gated Fusion Mechanism to efficiently integrate features from the WFNO and an attention-based neural operator, enhancing the network's capability to capture both global and local image details. To further improve efficiency, DiffFNO employs a deterministic ODE sampling strategy called the Adaptive Time-step ODE Solver (AT-ODE), which accelerates inference by dynamically adjusting step sizes while preserving output quality.Extensive experiments demonstrate that DiffFNO achieves state-of-the-art results, outperforming existing methods across various scaling factors, including those beyond the training distribution, by a margin of 2–4 dB in PSNR. Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.
Poster
Haitao Wu · Qing Li · Changqing Zhang · Zhen He · Xiaomin Ying

[ ExHall D ]

Abstract
Can our brain signals faithfully reflect the original visual stimuli, even including high-frequency details? Although human perceptual and cognitive capacities enable us to process and remember visual information, these abilities are constrained by several factors, such as limited attentional resources and the finite capacity of visual working memory. When visual stimuli are processed by the human visual system into brain signals, some information is inevitably lost, leading to a discrepancy known as the \textbf{System GAP}.Additionally, perceptual and cognitive dynamics, along with technical noise in signal acquisition, reduce the fidelity of brain signals relative to the original visual stimuli, known as the \textbf{Random GAP}.When encoded brain signal representations are directly aligned with the corresponding pretrained image features, the System GAP and Random GAP between paired data challenge the model, requiring it to bridge these gaps.However, in the context of limited paired data, these gaps become difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the \textbf{Uncertainty-aware Blur Prior (UBP)}.It estimates the uncertainty within the paired data, reflecting the mismatch between brain signals and visual stimuli. Based on this uncertainty, UBP dynamically blurs the …
Poster
Jiuchen Chen · Xinyu Yan · Qizhi Xu · Kaiqi Li

[ ExHall D ]

Abstract
Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 × 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 × 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset will soon be made publicly available.
Poster
Woo Kyoung Han · Byeonghun Lee · Hyunmin Cho · Sunghoon Im · Kyong Hwan Jin

[ ExHall D ]

Abstract
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Source codes are included in the supplementary material.
Poster
Wei Long · Xingyu Zhou · Leheng Zhang · Shuhang Gu

[ ExHall D ]

Abstract
Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.
Poster
Yuxuan Jiang · Ho Man Kwan · jasmine peng · Ge Gao · Fan Zhang · Xiaoqing Zhu · Joel Sole · David Bull

[ ExHall D ]

Abstract
Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new Hierarchical encoding based Implicit Image Function for continuous image super-resolution, HIIF, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at www.github.com.
Poster
Yulu Bai · Jiahong Fu · Qi Xie · Deyu Meng

[ ExHall D ]

Abstract
Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network's representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task requirements. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches. We will release the source code once the paper is accepted for publication.
Poster
Fengjia Zhang · Samrudhdhi Rangrej · Tristan T Aumentado-Armstrong · Afsaneh Fazly · Alex Levinshtein

[ ExHall D ]

Abstract
Super-resolution (SR), a classical inverse problem in computer vision, is inherently ill-posed, inducing a distribution of plausible solutions for every input. However, the desired result is not simply the expectation of this distribution, which is the blurry image obtained by minimizing pixelwise error, but rather the sample with the highest image quality. A variety of techniques, from perceptual metrics to adversarial losses, are employed to this end. In this work, we explore an alternative: utilizing powerful non-reference image quality assessment (NR-IQA) models in the SR context. We begin with a comprehensive analysis of NR-IQA metrics on human-derived SR data, identifying both the accuracy (human alignment) and complementarity of different metrics. Then, we explore two methods of applying NR-IQA models to SR learning: (i) altering data sampling, by building on an existing multi-ground-truth SR framework, and (ii) directly optimizing a differentiable quality score. Our results demonstrate a more human-centric perception-distortion tradeoff, focusing less on non-perceptual pixelwise distortion, instead improving the balance between perceptual fidelity and human-tuned NR-IQA measures.
Poster
Long Ma · Tengyu Ma · Ziye Li · Yuetong Wang · Jinyuan Liu · Chengpei Xu · Risheng Liu

[ ExHall D ]

Abstract
Recently, enhancing image quality in the original RAW domain has garnered significant attention, with denoising and reconstruction emerging as fundamental tasks. Although some works attempt to couple these tasks, they primarily focus on multi-stage learning while neglecting task associativity within a broader parameter space, leading to suboptimal performance. This work introduces a novel approach by rethinking denoising and reconstruction from a "backbone-head" perspective, leveraging the stronger shared parameter space offered by the backbone, compared to the encoder used in existing works. We derive task-specific heads with fewer parameters to mitigate learning pressure. By incorporating Chromaticity-aware attention into the backbone and introducing an adaptive denoising prior during training, we enable simultaneous reconstruction and denoising. Additionally, we design a dual-head interaction module to capture the latent correspondence between the two tasks, significantly enhancing multi-task accuracy. Extensive experiments validate the superiority of our approach.
Poster
Lingchen Sun · Rongyuan Wu · Zhiyuan Ma · Shuaizheng Liu · Qiaosi Yi · Lei Zhang

[ ExHall D ]

Abstract
Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present \textbf{Pi}xel-level and \textbf{S}emantic-level \textbf{A}djustable \textbf{SR} (\textbf{PiSA-SR}), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results.We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the 2-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, …
Poster
Xudong Li · Wenjie Nie · Yan Zhang · Runze Hu · Ke Li · Xiawu Zheng · Liujuan Cao

[ ExHall D ]

Abstract
In the Blind Image Quality Assessment (BIQA) field, accurately assessing the quality of authentically distorted images presents a substantial challenge due to the diverse distortion types in natural settings. Existing State-of-the-art IQA methods mix a sequence of distortions to entire images to establish distortion priors, but are inadequate for images with spatially local distortions. To address this, we introduce a novel IQA framework that employs knowledge distillation tailored to perceive spatially heterogeneous distortions, enhancing quality-distortion awareness. Specifically, we introduce a novel Block-wise Degradation Modelling approach that applies distinct distortions to different spatial blocks of an image, thereby expanding varied distortion priors. Following this, we present a Block-wise Aggregation and Filtering module that enables fine-grained attention to the quality perception within different distortion areas of the image. Furthermore, to enhance the granularity of distortion perception across various regions while maintaining quality perception, we incorporate strategies of Contrastive Knowledge Distillation and Affinity Knowledge Distillation to learn the distortion discrimination power and distortion correlation of different regions, respectively. Extensive experiments on seven standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods, i.e., achieving the PLCC values of 0.947 (1.2% vs. 0.936 in KADID) and 0.735 (8.3% vs. 0.679 …
Poster
Jinho Jeong · Sangmin Han · Jinwoo Kim · Seon Joo Kim

[ ExHall D ]

Abstract
In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space.Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition.Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs.To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details.Our extensive experiments demonstrate that LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent upsampling in preserving detail and sharpness.
Poster
Guangqian Guo · Yong Guo · Xuehui Yu · Wenbo Li · Yaoxing Wang · Shan Gao

[ ExHall D ]

Abstract
Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset. Codes and datasets will be available soon.
Poster
Yuhan Wang · Suzhi Bi · Ying-Jun Angela Zhang · Xiaojun Yuan

[ ExHall D ]

Abstract
The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.
Poster
Zhifu Tian · Tao Hu · Chaoyang Niu · Di Wu · Shu Wang

[ ExHall D ]

Abstract
Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects.
Poster
Tomer Garber · Tom Tirer

[ ExHall D ]

Abstract
In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such zero-shot'' restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation …
Poster
Zhi Jiang · Jingbo Hu · Ling Zhang · Gang Fu · Chunxia Xiao

[ ExHall D ]

Abstract
Despite significant advances in the field of specular highlight removal in recent years, existing methods predominantly focus on natural images, where highlights typically appear on raised or edged surfaces of objects. These highlights are often small and sparsely distributed. However, for text images such as cards and posters, the flat surfaces reflect light uniformly, resulting in large areas of highlights. Current methods struggle with these large-area highlights in text images, often producing severe visual artifacts or noticeable discrepancies between filled pixels and the original image in the central high-intensity highlight areas. To address these challenges, we propose the Hierarchical Adaptive Filtering Network (HAFNet). Our approach performs filtering at both the downsampled deep feature layer and the upsampled image reconstruction layer. By designing and applying the Adaptive Comprehensive Filtering Module (ACFM) and Adaptive Dilated Filtering Module (ADFM) at different layers, our method effectively restores semantic information in large-area specular highlight regions and recovers detail loss at various scales. The required filtering kernels are pre-generated by a prediction network, allowing them to adaptively adjust according to different images and their semantic content, enabling robust performance across diverse scenarios. Additionally, we utilize Unity3D to construct a comprehensive large-area highlight dataset featuring images with …
Poster
Yi Liu · Hao Zhou · Benlei Cui · Wenxiang Shang · Ran Lin

[ ExHall D ]

Abstract
Erase inpainting, or object removal, aims to precisely remove target objects within masked regions while preserving the overall consistency of the surrounding content. Despite diffusion-based methods have made significant strides in the field of image inpainting, challenges remain regarding the emergence of unexpected objects or artifacts. We assert that the inexact diffusion pathways established by existing standard optimization paradigms constrain the efficacy of object removal. To tackle these challenges, we propose a novel Erase Diffusion, termed EraDiff, aimed at unleashing the potential power of standard diffusion in the context of object removal. In contrast to standard diffusion, the EraDiff adapts both the optimization paradigm and the network to improve the coherence and elimination of the erasure results.We first introduce a Chain-Rectifying Optimization (CRO) paradigm, a sophisticated diffusion process specifically designed to align with the objectives of erasure. This paradigm establishes innovative diffusion transition pathways that simulate the gradual elimination of objects during optimization, allowing the model to accurately capture the intent of object removal. Furthermore, to mitigate deviations caused by artifacts during the sampling pathways, we develop a simple yet effective Self-Rectifying Attention (SRA) mechanism. The SRA calibrates the sampling pathways by altering self-attention activation, allowing the model to effectively …
Poster
Yichi Zhang · Zhihao Duan · Yuning Huang · Fengqing Zhu

[ ExHall D ]

Abstract
Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2\% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. The code will be made publicly available.
Poster
Yifan Zhou · Zeqi Xiao · Shuai Yang · Xingang Pan

[ ExHall D ]

Abstract
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.
Poster
Pascal Chang · Sergio Sancho · Jingwei Tang · Markus Gross · Vinicius C. Azevedo

[ ExHall D ]

Abstract
Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce _Laplacian Pyramid Warping_, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams [Geng et al. 2024] to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.
Poster
Sora Kim · Sungho Suh · Minsik Lee

[ ExHall D ]

Abstract
Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains.Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, i.e., conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
Poster
Lucas Relic · Roberto Azevedo · Yang Zhang · Markus Gross · Christopher Schroers

[ ExHall D ]

Abstract
Generative neural image compression supports data representation with extremely low bitrate, allowing clients to synthesize details and consistently producing highly realistic images. By leveraging the similarities between quantization error and additive noise, diffusion-based generative image compression codecs can be built using a latent diffusion model to denoise'' the artifacts introduced by quantization. However, we identify three critical gaps in previous approaches following this paradigm~(namely, the noise level, noise type, and discretization gaps) that result in the quantized data falling out of the data distribution known by the diffusion model. In this work, we propose a novel quantization-based forward diffusion process with theoretical foundations that tackles all three aforementioned gaps. We achieve this through universal quantization with a carefully tailored quantization schedule and a diffusion model trained with uniform noise. Compared to previous work, our proposal produces consistently realistic and detailed reconstructions, even at very low bitrates. In such a regime, we achieve the best rate-distortion-realism performance, outperforming previous related works.
Poster
Nick Stracke · Stefan Andreas Baumann · Kolja Bauer · Frank Fundel · Björn Ommer

[ ExHall D ]

Abstract
Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
Poster
Haosen Yang · Adrian Bulat · Isma Hadji · Hai X. Pham · Xiatian Zhu · Georgios Tzimiropoulos · Brais Martinez

[ ExHall D ]

Abstract
Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined FAM diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.
Poster
Yuchuan Tian · Jing Han · Chengcheng Wang · Yuchen Liang · Chao Xu · Hanting Chen

[ ExHall D ]

Abstract
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage.
Poster
Yushu Wu · Zhixing Zhang · Yanyu Li · Yanwu Xu · Anil Kag · Yang Sui · Huseyin Coskun · Ke Ma · Aleksei Lebedev · Ju Hu · Dimitris N. Metaxas · Yanzhi Wang · Sergey Tulyakov · Jian Ren

[ ExHall D ]

Abstract
We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
Poster
Ziqi Pang · Tianyuan Zhang · Fujun Luan · Yunze Man · Hao Tan · Kai Zhang · William Freeman · Yu-Xiong Wang

[ ExHall D ]

Abstract
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enabling random order is to insert a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports in-painting, outpainting and resolution extrapolation in a zero-shot manner.We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios.
Poster
Zijian Zhou · Shikun Liu · Xiao Han · Haozhe Liu · Kam Woh Ng · Tian Xie · Yuren Cong · Hang Li · Mengmeng Xu · Juan-Manuel Pérez-Rúa · Aditya Patel · Tao Xiang · Miaojing Shi · Sen He

[ ExHall D ]

Abstract
Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person’s appearance or pose.However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality.We attribute these distortions to inadequate attention to corresponding regions in the reference image.To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training.Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline.Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality.Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
Poster
Xiao Zhang · Ruoxi Jiang · Rebecca Willett · Michael Maire

[ ExHall D ]

Abstract
We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.
Poster
Myunsoo Kim · Donghyeon Ki · Seong-Woong Shim · Byung-Jun Lee

[ ExHall D ]

Abstract
As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.
Poster
Nanye Ma · Shangyuan Tong · Haolin Jia · Hexiang Hu · Yu-Chuan Su · Mingda Zhang · Xuan Yang · Yandong Li · Tommi Jaakkola · Xuhui Jia · Saining Xie

[ ExHall D ]

Abstract
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in large language models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of function evaluations (NFE), although the performance gains typically flatten after a few dozen steps. In this work, we present a framework on the inference-time scaling for diffusion models, that enables diffusion models to further benefit from the increased computation beyond the NFE plateau. Specifically, we consider a search problem aimed at identifying better noises during the sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be …
Poster
Hermann Kumbong · Xian Liu · Tsung-Yi Lin · Ming-Yu Liu · Xihui Liu · Ziwei Liu · Daniel Y Fu · Christopher Re · David W. Romero

[ ExHall D ]

Abstract
Visual AutoRegressive modeling (VAR) shows promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolutions) scales. However, this formulation suffers from reduced image quality due to parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the resolution sampling schedule.We introduce H_ierarchical M_asked A_utoR_egressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256×256 and 512×512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware …
Poster
Liao Qu · Huichao Zhang · Yiheng Liu · Xu Wang · Yi Jiang · Yiming Gao · Hu Ye · Daniel Kang Du · Zehuan Yuan · Xinglong Wu

[ ExHall D ]

Abstract
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384×384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256×256 resolution, achieving comparable results to SDXL.
Poster
Subhadeep Koley · Tapas Kumar Dutta · Aneeshan Sain · Pinaki Nath Chowdhury · Ayan Kumar Bhunia · Yi-Zhe Song

[ ExHall D ]

Abstract
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35\%), recognition (+1.06\%), segmentation (+29.42\%), and correspondence learning (+21.22\%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
Poster
Roberto Henschel · Levon Khachatryan · Hayk Poghosyan · Daniil Hayrapetyan · Vahram Tadevosyan · Zhangyang Wang · Shant Navasardyan · Humphrey Shi

[ ExHall D ]

Abstract
Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, simplifying the process of producing diverse and individual content. Current methods excel in generating short videos (up to 16s), but produce hard-cuts when naively extended to long video synthesis. To overcome these limitations, we present StreamingT2V, an autoregressive method that generates long videos of up to 2 minutes or longer with seamless transitions. The key components are: (i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the preceding chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module (APM), which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that allows for the autoregressive application of a video enhancer on videos of indefinite length, ensuring consistency across chunks. Experiments show that StreamingT2V produces more motion, while competing methods suffer from video stagnation when applied naively in an autoregressive fashion. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator, surpassing competitors in both consistency and motion.
Poster
Hongjie Wang · Chih-Yao Ma · Yen-Cheng Liu · Ji Hou · Tao Xu · Jialiang Wang · Felix Juefei-Xu · Yaqiao Luo · Peizhao Zhang · Tingbo Hou · Peter Vajda · Niraj Jha · Xiaoliang Dai

[ ExHall D ]

Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6\% win rate) in video quality with up to 15× (11.5×) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video …
Poster
Yukun Wang · Longguang Wang · Zhiyuan Ma · Qibin Hu · Kai Xu · Yulan Guo

[ ExHall D ]

Abstract
Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control strategy is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.
Poster
Dohun Lee · Bryan Sangwoo Kim · Geon Yeong Park · Jong Chul Ye

[ ExHall D ]

Abstract
Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://videoguide2025.github.io/
Poster
Xi Wang · Robin Courant · Marc Christie · Vicky Kalogeiton

[ ExHall D ]

Abstract
Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future camera diffusion methods.
Poster
Mingi Kwon · Shin seong Kim · Jaeseok Jeong · Yi-Ting Hsiao · Youngjung Uh

[ ExHall D ]

Abstract
Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from xt to xt1, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.
Poster
Zixuan Ye · Huijuan Huang · Xintao Wang · Pengfei Wan · Di ZHANG · Wenhan Luo

[ ExHall D ]

Abstract
Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images.
Poster
Nadav Z. Cohen · Oron Nir · Ariel Shamir

[ ExHall D ]

Abstract
Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.
Poster
Yufan Ren · Zicong Jiang · Tong Zhang · Søren Forchhammer · Sabine Süsstrunk

[ ExHall D ]

Abstract
Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications such as loss of local details and color alterations. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables selective optimization of specific frequency bands within spatially localized regions, allowing for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications across different levels of detail. To extend the applicability of our approach, we also provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, achieving frequency-aware adjustments for 3D textures editing. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits. Code will be released upon publication.
Poster
Fengyi Fu · Lei Zhang · Mengqi Huang · Zhendong Mao

[ ExHall D ]

Abstract
Text-based image editing which aims at generating rigid or non-rigid changes to images conditioned on the given text, has recently attracted considerable interest. Previous works mainly follow the multi-step denoising diffusion paradigm, which adopts a fixed text guidance intensity (i.e., editing intensity) to inject textual features, while ignoring the step-specific editing requirements. This work argues that the editing intensity at each denoising step should be adaptively adjusted conditioned on the historical editing degree, to provide accurate text guidance for the whole denoising process. We thereby propose a novel feedback editing framework (FeedEdit), a training-free method which, explicitly exploits the feedback regulation on editing intensity to ensure precise and harmonious editing at all steps. Specifically, we design (1) Dynamic Editing Degree Perceiving module, which is based on specific frequency-domain filtering, to enhance and exploit the correlation between feature differences and editing degree for perceiving. (2) Proportional-Integral feedback controller, to automatically map the perceived editing errors into appropriate feedback control signals. (3) Phrase-level Regulating Strategy, to achieve fine-grained function-specific regulation of textual features. Extensive experiments demonstrate the superiority of FeedEdit over existing methods in both editability and quality, especially for multi-function editing scenarios.
Poster
Duong H. Le · Tuan Pham · Sangho Lee · Christopher Clark · Aniruddha Kembhavi · Stephan Mandt · Ranjay Krishna · Jiasen Lu

[ ExHall D ]

Abstract
We introduce \texttt{OneDiffusion} - a single large-scale diffusion model designed to tackle a wide range of image synthesis and understanding tasks. It can generate images conditioned on text, depth, pose, layout, or semantic maps. It also handles super-resolution, multi-view generation, instant personalization, and text-guided image editing. Additionally, the same model is also effective for image understanding tasks such as depth estimation, pose estimation, open-vocab segmentation and camera pose estimation, etc.Our unified model is trained with simple but effective recipe: we casting all tasks as modeling a sequence of frames where we inject different noise scales during training to each frame. During inference, we can set any frame to a clean image for tasks like conditional image generation, camera pose estimation, or generating paired depth and images from a caption. Experiment results shows our proposed method achieve competitive performance on wide range of tasks. By eliminating the need for specialized architectures or finetuning, OneDiffusion provides enhanced flexibility and scalability, reflecting the emerging trend towards general-purpose diffusion models in the field.
Poster
Yanfeng Li · Ka-Hou Chan · Yue Sun · Chan-Tong Lam · Tong Tong · Zitong YU · Keren Fu · Xiaohong Liu · Tao Tan

[ ExHall D ]

Abstract
Multi-object images are widely present in the real world, spanning various areas of daily life. Efficient and accurate editing of these images is crucial for applications such as augmented reality, advertisement design, and medical imaging. Stable Diffusion (SD) has ushered in a new era of high-quality image generation and editing. However, existing methods struggle to analyze abstract relationships between multiple objects, often yielding suboptimal performance. To address this, we propose MoEdit, an auxiliary-free method for multi-object image editing. This method enables high-quality editing of attributes in multi-object images, such as style, object features, and background, by maintaining quantity consistency between the input and output images. To preserve the concept of quantity, we introduce a Quantity Attention (QTTN) module to control the editing process. Additionally, we present a Feature Compensation (FeCom) module to enhance the robustness of object preservation. We also employ a null-text training strategy to retain the guiding capability of text prompts, making them plug-and-play modules. This method leverages the powerful image perception and generation capabilities of the SD model, enabling specific concepts to be preserved and altered between input and output images. Experimental results demonstrate that MoEdit achieves the state-of-the-art performance in high-quality multi-object image editing. Code will …
Poster
Yingjing Xu · Jie Kong · Jiazhi Wang · Xiao Pan · Bo Lin · Qiang Liu

[ ExHall D ]

Abstract
In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
Poster
Mingdeng Cao · Xuaner Zhang · Yinqiang Zheng · Zhihao Xia

[ ExHall D ]

Abstract
This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics—such as non-rigid subject motion and complex camera movements—that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
Poster
Mushui Liu · Dong She · Qihan Huang · Jiacheng Ying · Wanggui He · Jingxuan Pang · Yuanlei Hou · Siming Fu

[ ExHall D ]

Abstract
Subject-driven image personalization has seen notable advancements, especially with the advent of the ReferenceNet paradigm. ReferenceNet excels in integrating image reference features, making it highly applicable in creative and commercial settings. However, current implementations of ReferenceNet primarily operate as latent-level feature extractors, which limit their potential. This constraint hinders the provision of appropriate features to the denoising backbone across different timesteps, leading to suboptimal image consistency. In this paper, we revisit the extraction of reference features and propose TFCustom, a model framework designed to focus on reference image features at different temporal steps and frequency levels. Specifically, we firstly propose synchronized ReferenceNet to extract reference image features while simultaneously optimizing noise injection and denoising for the reference image. We also propose a time-aware frequency feature refinement module that leverages high- and low-frequency filters, combined with time embeddings, to adaptively select the degree of reference feature injection. Additionally, to enhance the similarity between reference objects and the generated image, we introduce a novel reward-based loss that encourages greater alignment between the reference and generated images. Experimental results demonstrate state-of-the-art performance in both multi-object and single-object reference generation, with significant improvements in texture and textual detail generation over existing methods.
Poster
Pengfei Zhou · Xiaopeng Peng · Jiajun Song · Chuanhao Li · Zhaopan Xu · Yue Yang · Ziyao Guo · Hao Zhang · Yuqi Lin · Yefei He · Lirui Zhao · Shuo Liu · Tianhua Li · Yuxuan Xie · Xiaojun Chang · Yu Qiao · Wenqi Shao · Kaipeng Zhang

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The benchmark, code and judge models will be released.
Poster
Edurne Bernal-Berdun · Ana Serrano · Belen Masia · Matheus Gadelha · Yannick Hold-Geoffroy · Xin Sun · Diego Gutierrez

[ ExHall D ]

Abstract
Images as an artistic medium often rely on specific camera angles and lens distortions to convey ideas or emotions; however, such precise control is missing in current text-to-image models. We propose an efficient and general solution that allows precise control over the camera when generating both photographic and artistic images. Unlike prior methods that rely on predefined shots, we rely solely on four simple extrinsic and intrinsic camera parameters, removing the need for pre-existing geometry, reference 3D objects, and multi-view data.We also present a novel dataset with more than 57,000 images, along with their text prompts and ground-truth camera parameters. Our evaluation shows precise camera control in text-to-image generation, surpassing traditional prompt engineering approaches.
Poster
Jialuo Li · Wenhao Chai · XINGYU FU · Haiyang Xu · Saining Xie

[ ExHall D ]

Abstract
This paper presents a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. We present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. We also introduce SciBench, an expert-annotated adversarial dataset comprising 30k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging SciBench, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human experts. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% based on SciScore.
Poster
Wataru Shimoda · Naoto Inoue · Daichi Haraguchi · Hayato Mitani · Seiichi Uchida · Kota Yamaguchi

[ ExHall D ]

Abstract
While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image.In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline.Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words.Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
Poster
Qihao Liu · Xi Yin · Alan L. Yuille · Andrew Brown · Mannat Singh

[ ExHall D ]

Abstract
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps …
Poster
Chao Feng · Ziyang Chen · Aleksander Holynski · Alexei A. Efros · Andrew Owens

[ ExHall D ]

Abstract
We show that the GPS tags within image metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images change their appearance throughout a city. In particular, we train a diffusion model to generate images conditioned on both GPS tag and text, and find that the resulting model learns to successfully vary the contents of its generated images using both control signals, such as by creating images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also show that GPS conditioning enables us to lift'' 3D models from 2D GPS-to-image models using score distillation sampling, without need for explicit camera pose estimation. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
Poster
Zijie Li · Henry Li · Yichun Shi · Amir Barati Farimani · Yuval Kluger · Linjie Yang · Peng Wang

[ ExHall D ]

Abstract
Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models.We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities.Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.
Poster
Rishubh Parihar · Vaibhav Agrawal · Sachidanand VS · R. Venkatesh Babu

[ ExHall D ]

Abstract
Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbf{compass} tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D-assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training, and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the …
Poster
Jiaxiu Jiang · Yabo Zhang · Kailai Feng · Xiaohe Wu · Wenbo Li · Renjing Pei · Fan Li · Wangmeng Zuo

[ ExHall D ]

Abstract
Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC2, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC2 enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC2 outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC2 can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available.
Poster
Bin Wu · Wuxuan Shi · Jinqiao Wang · Mang Ye

[ ExHall D ]

Abstract
Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM's generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM's feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.
Poster
Florinel Croitoru · Vlad Hondru · Radu Tudor Ionescu · Nicu Sebe · Mubarak Shah

[ ExHall D ]

Abstract
Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on six benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-757F.
Poster
Rui Zhao · Weijia Mao · Mike Zheng Shou

[ ExHall D ]

Abstract
Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bidirectional mappings between vision and language learned by the unified generative model to enable training on unpaired data for domain adaptation. Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. The model is optimized through cross-entropy loss computed at the cycle endpoints, where both endpoints share the same modality. This facilitates self-evolution of the model without reliance on annotated text-image pairs. Experimental results demonstrate that for tasks independent of paired knowledge, such as stylization, DoraCycle can effectively adapt the unified model using only unpaired data. For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation. The code and model weights will be released.
Poster
Cong Xie · Han Zou · Ruiqi Yu · Yan Zhang · Zhan Zhenpeng

[ ExHall D ]

Abstract
In this work, we are interested in achieving both high text controllability and overall appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's overall appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.
Poster
Yuanbo Yang · Jiahao Shao · Xinyang Li · Yujun Shen · Andreas Geiger · Yiyi Liao

[ ExHall D ]

Abstract
In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation.
Poster
Silin Gao · Sheryl Mathew · Li Mi · Sepideh Mamooler · Mengjie Zhao · Hiromi Wakaki · Yuki Mitsufuji · Syrielle Montariol · Antoine Bosselut

[ ExHall D ]

Abstract
Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with our VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
Poster
Bonan Li · Zicheng Zhang · Xingyi Yang · Xinchao Wang

[ ExHall D ]

Abstract
Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into …
Poster
Zeqi Gu · Yin Cui · Max Li · Fangyin Wei · Yunhao Ge · Jinwei Gu · Ming-Yu Liu · Abe Davis · Yifan Ding

[ ExHall D ]

Abstract
Designing 3D scenes is traditionally a challenging and laborious task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. We generate the 2D intermediary image from a scene description, extract object shapes and appearances, create 3D models, and assemble them into the final scene with geometry, position, and pose extracted from the same image. Being generalizable to a wide range of scenes and styles, ArtiScene is shown to outperform state-of-the-art benchmarks by a large margin in layout …
Poster
Jiaxin Ge · Zora Zhiruo Wang · Xuhui Zhou · Yi-Hao Peng · Sanjay Subramanian · Qinyue Tan · Maarten Sap · Alane Suhr · Daniel Fried · Graham Neubig · Trevor Darrell

[ ExHall D ]

Abstract
Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills to deliver insights. In this work, we study automated slide generation, where models produce slide presentations from natural language (NL) instructions of varying specificity. We first introduce our SLIDESBENCH benchmark, with 585 examples derived from slide decks across 10 domains. SLIDESBENCH supports evaluations that are both (i) reference-based to measure similarity to a target slide, and (ii) reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with varied models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create PRESENTER, an 8B LLAMA slide generation model trained on 7k (NL, slide code) pairs, and achieve results comparable to the strongest model GPT-4O. Beyond one-pass slide generation, we explore iterative design refinement with model self-critiques and effectively improve element layout in slides. We hope that our work could provide a basis for future work on generating structured visuals.
Poster
Xi Wang · Hongzhen Li · Heng Fang · YICHEN PENG · Haoran Xie · Xi Yang · Chuntao Li

[ ExHall D ]

Abstract
Image rendering from line drawings is vital in design and image generation technologies reduce costs, yet professional line drawings demand preserving complex details. Text prompts struggle with accuracy, and image translation struggles with consistency and fine-grained control. We present LineArt, a framework that transfers complex appearance onto detailed design drawings, facilitating design and artistic creation. It generates high-fidelity materials while preserving structural accuracy by simulating hierarchical visual cognition and integrating human artistic experience to guide the diffusion process.LineArt overcomes the limitations of current methods in terms of difficulty in fine-grained control and style degradation in design drawings. It requires no precise 3D modeling, physical property specs, or network training, making it more convenient for design tasks. LineArt consists of two stages: a multi-frequency lines fusion module to supplement the input design drawing with detailed structural information and a two-part painting process for Base Layer Shaping and Surface Layer Coloring. We also present a new design drawing dataset ProLines for evaluation. The experiments show that LineArt performs better in accuracy, realism, and material precision compared to SOTAs.
Poster
Siyuan Bian · Chenghao Xu · Yuliang Xiu · Artur Grigorev · Zhen Liu · Cewu Lu · Michael J. Black · Yao Feng

[ ExHall D ]

Abstract
We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garment sewing patterns from images or text descriptions. Unlike previous methods that often lack robustness and interactive editing capabilities, ChatGarment finetunes a VLM to produce GarmentCode, a JSON-based, language-friendly format for 2D sewing patterns, enabling both estimating and editing from images and text instructions. To optimize performance, we refine GarmentCode by expanding its support for more diverse garment types and simplifying its structure, making it more efficient for VLM finetuning. Additionally, we develop an automated data construction pipeline to generate a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs, empowering ChatGarment with strong generalization across various garment types. Extensive evaluations demonstrate ChatGarment’s ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications.
Poster
Haobin Zhong · Shuai He · Anlong Ming · Huadong Ma

[ ExHall D ]

Abstract
The Personalized Aesthetics Assessment (PAA) aims to accurately predict an individual's unique perception of aesthetics. With the surging demand for customization, PAA enables applications to generate personalized outcomes by aligning with individual aesthetic preferences. The prevailing PAA paradigm involves two stages: pre-training and fine-tuning, but it faces three inherent challenges: 1) The model is pre-trained using datasets of the Generic Aesthetics Assessment (GAA), but the collective preferences of GAA lead to conflicts in individualized aesthetic predictions. 2) The scope and stage of personalized surveys are related to both the user and the assessed object; however, the prevailing personalized surveys fail to adequately address assessed objects' characteristics. 3) During application usage, the cumulative multimodal feedback from an individual holds great value that should be considered for improving the PAA model but unfortunately attracts insufficient attention. To address the aforementioned challenges, we introduce a new PAA paradigm called PAA+, which is structured into three distinct stages: pre-training, fine-tuning, and domain-incremental learning. Furthermore, to better reflect individual differences, we employ a familiar and intuitive application, physique aesthetics assessment (PhysiqueAA), to validate the PAA+ paradigm. We propose a dataset called PhysiqueAA50K, consisting of over 50,000 fully annotated physique images. Furthermore, we develop a PhysiqueAA …
Poster
Zirun Guo · Tao Jin

[ ExHall D ]

Abstract
Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. Through comprehensive experiments, we demonstrate that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.
Poster
Qianlong Xiang · Miao Zhang · Yuzhang Shang · Jianlong Wu · Yan Yan · Liqiang Nie

[ ExHall D ]

Abstract
Diffusion models (DMs) have demonstrated exceptional generative capabilities across various domains, including image, video, and so on. A key factor contributing to their effectiveness is the high quantity and quality of data used during training. However, mainstream DMs now consume increasingly large amounts of data. For example, training a Stable Diffusion model requires billions of image-text pairs. This enormous data requirement poses significant challenges for training large DMs due to high data acquisition costs and storage expenses. To alleviate this data burden, we propose a novel scenario: using existing DMs as data sources to train new DMs with any architecture. We refer to this scenario as Data-Free Knowledge Distillation for Diffusion Models (DKDM), where the generative ability of DMs is transferred to new ones in a data-free manner. To tackle this challenge, we make two main contributions. First, we introduce a DKDM objective that enables the training of new DMs via distillation, without requiring access to the data. Second, we develop a dynamic iterative distillation method that efficiently extracts time-domain knowledge from existing DMs, enabling direct retrieval of training data without the need for a prolonged generative process. To the best of our knowledge, we are the first to explore …
Poster
Matan Rusanovsky · Shimon Malnick · Amir Jevnisek · Ohad Fried · Shai Avidan

[ ExHall D ]

Abstract
Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts.In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts.Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept.We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept.Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.
Poster
Basim Azam · Naveed Akhtar

[ ExHall D ]

Abstract
Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results. Our code and models will be made …
Poster
Yimeng Zhang · Tiancheng Zhi · Jing Liu · Shen Sang · Liming Jiang · Qing Yan · Sijia Liu · Linjie Luo

[ ExHall D ]

Abstract
The ability to synthesize personalized group photos and specify the positions of each identity offers immense creative potential. While such imagery can be visually appealing, it presents significant challenges for existing technologies. A persistent issue is identity (ID) leakage, where injected facial features interfere with one another, resulting in low face resemblance, incorrect positioning, and visual artifacts. Existing methods suffer from limitations such as the reliance on segmentation models, increased runtime, or a high probability of ID leakage. To address these challenges, we propose ID-Patch, a novel method that provides robust association between identities and 2D positions. Our approach generates an ID patch and ID embeddings from the same facial features: the ID patch is positioned on the conditional image for precise spatial control, while the ID embeddings integrate with text embeddings to ensure high resemblance. Experimental results demonstrate that ID-Patch surpasses baseline methods across metrics, such as face ID resemblance, ID-position association accuracy, and generation efficiency.Code and models will be made publicly available upon publication of the paper.
Poster
Hao Cheng · Erjia Xiao · Jiayan Yang · Jiahang Cao · Qiang Zhang · Jize Zhang · Kaidi Xu · Jindong Gu · Renjing Xu

[ ExHall D ]

Abstract
Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse.In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. Currently, to prevent this security threat, the various guard or defense methods that are proposed also focus on defending the language modality. However, in practical applications, threats in the visual modality, particularly in tasks involving the editing of real-world images, pose greater security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper uses a method named typographic attack to reveal that various image generation models also commonly face threats in the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models. \textcolor{red}{\textbf{Warning:} This paper includes explicit sexual content, discussions of pornography, racial terminology, and other content that may cause discomfort or distress. Potentially disturbing content has been blocked …
Poster
Xuanyu Zhang · Zecheng Tang · Zhipei Xu · Runyi Li · Youmin Xu · Bin Chen · Feng Gao · Jian Zhang

[ ExHall D ]

Abstract
With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose \textbf{OmniGuard}, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7\% in F1-Score under noisy conditions, and 14.8\% in average bit accuracy.
Poster
Yiren Song · Pei Yang · Hai Ci · Mike Zheng Shou

[ ExHall D ]

Abstract
Recently, zero-shot methods like InstantID have revolutionized identity-preserving generation. Unlike multi-image finetuning approaches such as DreamBooth, these zero-shot methods leverage powerful facial encoders to extract identity information from a single portrait photo, enabling efficient identity-preserving generation through a single inference pass. However, this convenience introduces new threats to the facial identity protection. This paper aims to safeguard portrait photos from unauthorized encoder-based customization. We introduce IDProtector, an adversarial noise encoder that applies imperceptible adversarial noise to portrait photos in a single forward pass. Our approach offers universal protection for portraits against multiple state-of-the-art encoder-based methods, including InstantID, IP-Adapter, and PhotoMaker, while ensuring robustness to common image transformations such as JPEG compression, resizing, and affine transformations. Experiments across diverse portrait datasets and generative models reveal that IDProtector generalizes effectively to unseen data and even closed-source proprietary models.
Poster
Mischa Dombrowski · Weitong Zhang · Hadrien Reynaud · Sarah Cechnicka · Bernhard Kainz

[ ExHall D ]

Abstract
Generative methods have reached a level of quality that is almost indistinguishable from real data.However, while individual samples may appear unique, generative models often exhibit limitations in covering the full data distribution. Unlike quality issues, diversity problems within generative models are not easily detected by simply observing single images or generated datasets, which means we need a specific measure to assess the diversity of these models. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively.Consequently, we perform an extensive search for the best feature extractors to assess diversity.Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art …
Poster
Tai Nguyen · Aref Azizpour · Matthew Stamm

[ ExHall D ]

Abstract
The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures—subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.
Poster
Jinwoo Kim · Sangmin Han · Jinho Jeong · Jiwoo Choi · Dongyoung Kim · Seon Joo Kim

[ ExHall D ]

Abstract
Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models.However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios comprehensively. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.
Poster
Dhananjaya Jayasundara · Sudarshan Rajagopalan · Yasiru Ranasinghe · Trac Tran · Vishal M. Patel

[ ExHall D ]

Abstract
Implicit Neural Representations (INRs) are increasingly recognized as a versatile data modality for representing discretized signals, offering benefits such as infinite query resolution and reduced storage requirements. Existing signal compression approaches for INRs typically employ one of two strategies: 1. direct quantization with entropy coding of the trained INR; 2. deriving a latent code on top of the INR through a learnable transformation. Thus, their performance is heavily dependent on the quantization and entropy coding schemes employed. In this paper, we introduce SINR, an innovative compression algorithm that leverages the patterns in the vector spaces formed by weights of INRs. We compress these vector spaces using a high-dimensional sparse code within a dictionary. Further analysis reveals that the atoms of the dictionary used to generate the sparse code do not need to be learned or transmitted to successfully recover the INR weights. We demonstrate that the proposed approach can be integrated with any existing INR-based signal compression technique. Our results indicate that SINR achieves substantial reductions in storage requirements for INRs across various configurations, outperforming conventional INR-based compression baselines. Furthermore, SINR maintains high-quality decoding across diverse data modalities, including images, occupancy fields, and Neural Radiance Fields.
Poster
Tiago Novello · Diana Aldana Moreno · André Araujo · Luiz Velho

[ ExHall D ]

Abstract
Sinusoidal neural networks have been shown effective as implicit neural representations (INRs) of low-dimensional signals, due to their smoothness and high representation capacity. However, initializing and training them remain empirical tasks which lack on deeper understanding to guide the learning process. To fill this gap, our work introduces a theoretical framework that explains the capacity property of sinusoidal networks and offers robust control mechanisms for initialization and training. Our analysis is based on a novel amplitude-phase expansion of the sinusoidal multilayer perceptron, showing how its layer compositions produce a large number of new frequencies expressed as integer combinations of the input frequencies. This relationship can be directly used to initialize the input neurons, as a form of spectral sampling, and to bound the network's spectrum while training. Our method, referred to as TUNER (TUNing sinusoidal nEtwoRks), greatly improves the stability and convergence of sinusoidal INR training, leading to detailed reconstructions, while preventing overfitting.
Poster
Yuki Kawana · Shintaro Shiba · Quan Kong · Norimasa Kobori

[ ExHall D ]

Abstract
We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject, scene, and objects and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject’s eyes are unavailable, such as when the subject is distant or facing away. Previous methods rely on either 2D appearance alone or limited spatial cues by using depth maps in non-learnable post-processing. Estimating 3D gaze direction from 2D observations in these settings is challenging; besides variations in subject, scene, and gaze direction, different camera poses produce various 2D appearances and 3D gaze directions, even when targeting the same 3D scene. To address this issue, we propose gaze-aware 3D context encoding. Our method represents the subject and scene as 3D poses and object positions as 3D context, learning spatial relationships in 3D space. Inspired by human vision, we geometrically align this context in a subject-centric space, significantly reducing the spatial complexity. Furthermore, we propose D3 (distance-direction-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments show substantial improvement, reducing mean angle error of leading baselines by 13\%–37\% on benchmark datasets in single-view settings.
Poster
Yunfeng Xiao · Xiaowei Bai · Baojun Chen · Hao Su · Hao He · Liang Xie · Erwei Yin

[ ExHall D ]

Abstract
3D Gaze estimation is a challenging task due to two main issues. First, existing methods focus on analyzing dense features (e.g., large areas of pixels), which are sensitive to local noise (e.g., light spots, blurs) and increase computational complexity. Second, an eyeball model can exhibit multiple gaze directions, and the coupled representation between gazes and models increases learning difficulty.To address these issues, we propose \textbf{De\textsuperscript{2}Gaze}, a lightweight and accurate model-aware 3D gaze estimation method. In De\textsuperscript{2}Gaze, we introduce two improvements for deformable and decoupled representation learning.Specifically, first, we propose a deformable sparse attention mechanism that can adapt sparse sampling points to attention areas to avoid local noise influences. Second, a spatial decoupling network with a double-branch decoding architecture is proposed to disentangle the invariant (e.g., eyeball radius, position) and variable (e.g., gaze, pupil, iris) features from latent space.Compared to existing methods, De\textsuperscript{2}Gaze only requires fewer sparse features, and achieves faster convergence speed, lower computational complexity, and higher accuracy in 3D gaze estimation.Qualitative and quantitative experiments show that De\textsuperscript{2}Gaze achieves state-of-the-art accuracy and semantic segmentation for 3D gaze estimation on the TEyeD dataset.
Poster
Tianyun Zhong · Chao Liang · Jianwen Jiang · Gaojie Lin · Jiaqi Yang · Zhou Zhao

[ ExHall D ]

Abstract
Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models.To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times.
Poster
Juncheng Wang · Chao Xu · Cheng Yu · Lei Shang · Zhe Hu · Shujun Wang · Liefeng Bo

[ ExHall D ]

Abstract
Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.
Poster
Inho Kim · YOUNGKIL SONG · Jicheol Park · Won Hwa Kim · Suha Kwak

[ ExHall D ]

Abstract
Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input.We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio.To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning.Also, we introduce cross-modal attention matching to further align local features of image and audio.Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.
Poster
Chen Liu · Liying Yang · Peike Li · Dadong Wang · Lincheng Li · Xin Yu

[ ExHall D ]

Abstract
Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, i.e., (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in …
Poster
Eitan Shaar · Ariel Shaulov · Gal Chechik · Lior Wolf

[ ExHall D ]

Abstract
In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis (AV2A), a model-agnostic approach that requires no further training and integrates an early fusion technique to retain richer multimodal interactions. AV2A also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that AV2A achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of AV2A
Poster
Zitang Zhou · Ke Mei · Yu Lu · Tianyi Wang · Fengyun Rao

[ ExHall D ]

Abstract
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music. Project page: https://harmonyset.github.io/.
Poster
Sanchayan Santra · Vishal Chudasama · Pankaj Wasnik · Vineeth Balasubramanian

[ ExHall D ]

Abstract
Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network, which may not be ideal for the task. Furthermore, these methods overlook the issue of imbalanced event class distribution present in the data, negatively impacting performance in challenging scenarios. This paper demonstrates that an appropriately designed network, trained end-to-end, can outperform state-of-the-art (SOTA) methods. Particularly, we propose a network with a convolutional spatial-temporal feature extractor enhanced with our proposed and a long-range temporal module Adaptive Spatio-Temporal Refinement Module (ASTRM) and a long-range temporal module. ASTRM helps enhances the features with spatio-temporal information. Meanwhile the long-range temporal module helps in extracting global context from the data by modeling long-range dependencies. To address the class imbalance issue, we introduce the Soft Instance Contrastive (SoftIC) loss that promotes feature compactness and class separation. Extensive experiments show that the proposed method is efficient and outperforms the SOTA methods, specifically in more challenging settings.
Poster
Bingjie Gao · Xinyu Gao · Xiaoxue Wu · yujie zhou · Yu Qiao · Li Niu · Xinyuan Chen · Yaohui Wang

[ ExHall D ]

Abstract
Text-to-video (T2V) generative models trained on large-scale datasets have made remarkable advancements. Well-performed prompts are often model-specific and coherent with the distribution in training prompts for T2V models. Previous studies utilize large language models to optimize user-provided prompts consistent with distribution of training prompts directly, lack of refined guidance considering both prompt vocabulary and specific sentence format. In this paper, we introduce a retrieval-augmented prompt optimization framework for T2V generation. The user-provided prompts are augmented with relevant and diverse modifiers retrieved from a built relation graph, and then refactored into the format of training prompts through a fine-tuned language model. To balance the retrieval speed and vocabulary diversity of relation graph, we propose a two branches optimization mechanism to determine the better prompt optimized from our method or large language model directly. Extensive experiments demonstrate that our proposed method can effectively enhance the both static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for simple user-provided prompts.
Poster
Xin Yan · Yuxuan Cai · Qiuyue Wang · Yuan Zhou · Wenhao Huang · Huan Yang

[ ExHall D ]

Abstract
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances the content richness, maintains long-range coherence, and captures intricate textual details. All code and model weights will be made publicly available.
Poster
Zhengrong Yue · Shaobin Zhuang · Kunchang Li · Yanbo Ding · Yali Wang

[ ExHall D ]

Abstract
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions,based on an open style description of user query.To fill this gap,we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content.Via a concise video-to-shot prompting paradigm,it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree.Via a robust tree-of-thought searching paradigm,it allows our V-Stylist to precisely specify vague style preference in the open user query.(3) Style Artist leverages the matched model to render all the video shots into the required style.Via a novel multi-round self-reflection paradigm,it allows our V-Stylist to adaptively adjust detail control,according to the style requirement.With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover, we further construct a new …
Poster
Huiyu Duan · Qiang Hu · Wang Jiarui · Liu Yang · Zitong Xu · Lu Liu · Xiongkuo Min · Chunlei Cai · Tianxiao Ye · Xiaoyun Zhang · Guangtao Zhai

[ ExHall D ]

Abstract
The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets. Both FineVD and FineVQ will be released upon the publication.
Poster
Kevin Qinghong Lin · Mike Zheng Shou

[ ExHall D ]

Abstract
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce **VLog**, a novel video understanding framework that defines video narrations as a vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, **VLog** features three key innovations:1. **A Generative Retrieval Model** Marrying the language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search.2. **A Hierarchical Vocabulary** Derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).3. **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on **EgoSchema**, **COIN**, and **HiREST** further demonstrate the effectiveness of **VLog**, highlighting its ability to generate concise, contextually accurate, and efficient narrations. This offers a novel perspective on video understanding.
Poster
Zicheng Zhang · Ziheng Jia · Haoning Wu · Chunyi Li · Zijian Chen · Yingjie Zhou · Wei Sun · Xiaohong Liu · Xiongkuo Min · Weisi Lin · Guangtao Zhai

[ ExHall D ]

Abstract
With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of …
Poster
Wei Li · Bing Hu · Rui Shao · Leyang Shen · Liqiang Nie

[ ExHall D ]

Abstract
First-person video assistants are highly anticipated to enhance our daily life through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "**F**ast & **S**low Video-Language Thinker" as on**LI**ne vide**O** assista**N**t, **LION-FS**, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: **1) Fast Path: Routing-Based Response Determination** evaluates frame-by-frame whether a immediate response is necessary. To enhance responses determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and **2) Slow Path: Multi-granularity Keyframe Augmentation** optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency. The codes will be released soon.
Poster
Kaixuan Wu · Xinde Li · Xinglin Li · Chuanfei Hu · Guoliang Wu

[ ExHall D ]

Abstract
In this paper, a novel benchmark for audio-visual question answering continual learning (AVQACL) is introduced, aiming to study fine-grained scene understanding and spatial-temporal reasoning in videos under a continual learning setting. To facilitate this multimodal continual leaning task, we create two audio-visual question answering continual learning datasets, named Split-AVQA and Split-MUSIC-AVQA based on the AVQA and MUSIC-AVQA datasets, respectively. The experimental results suggest that the model exhibits limited cognitive and reasoning abilities and experiences catastrophic forgetting when processing three modalities simultaneously in a continuous data stream. To address above challenges, we propose a novel continual learning method that incorporates question-guided cross-modal information fusion (QCIF) to focus on question-relevant details for improved feature representation and task-specific knowledge distillation with spatial-temporal feature constraints (TKD-STFC) to preserve the spatial-temporal reasoning knowledge acquired from previous dynamic scenarios. Furthermore, a question semantic consistency constraint (QSCC) is employed to ensure that the model maintains a consistent understanding of question semantics across tasks throughout the continual learning process. Extensive experimental results on Split-AVQA and Split-MUSIC-AVQA datasets illustrate that our method achieves state-of-the-art audio-visual question answering continual learning performance. Code is available at Supplementary Material.
Poster
Huabin Liu · Filip Ilievski · Cees G. M. Snoek

[ ExHall D ]

Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
Poster
Ziyang Wang · Shoubin Yu · Elias Stengel-Eskin · Jaehong Yoon · Feng Cheng · Gedas Bertasius · Mohit Bansal

[ ExHall D ]

Abstract
Long-form video understanding has been a challenging task due to the high redundancy in video data and the abundance of query-irrelevant information. To tackle this challenge, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that VideoTree improves both reasoning accuracy and efficiency compared to existing methods. Specifically, VideoTree outperforms the existing training-free approaches on the popular EgoSchema and NExT-QA benchmarks with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. …
Poster
Haiyi Qiu · Minghe Gao · Long Qian · Kaihang Pan · Qifan Yu · Juncheng Li · Wenjie Wang · Siliang Tang · Yueting Zhuang · Tat-seng Chua

[ ExHall D ]

Abstract
Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatial-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSG guided the derivation of multi-step reasoning Question-Answer (QA) data with Chain-of-Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3\% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self-generated rationale-enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential. The code …
Poster
Kangsan Kim · Geon Park · Youngwan Lee · Woongyeong Yeo · Sung Ju Hwang

[ ExHall D ]

Abstract
Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications.
Poster
Zhuoming Liu · Yiquan Li · Khoi D Nguyen · Yiwu Zhong · Yin Li

[ ExHall D ]

Abstract
We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.
Poster
Shuming Liu · Chen Zhao · Tianqi Xu · Bernard Ghanem

[ ExHall D ]

Abstract
Large vision-language models (VLMs) have shown promising progress in various video understanding tasks. However, their potential for long-form video analysis is limited by their high computational resource requirements and constrained context windows. Traditional methods, particularly uniform frame sampling, often allocate resources to irrelevant content, reducing effectiveness in real-world scenarios. This paper introduces BOLT to BOost Large VLMs without additional Training through an extensive study of frame selection strategies for large VLMs. To provide a realistic evaluation of VLMs in long-form video understanding, we first present a multi-source retrieval evaluation setting. Our findings show that uniform sampling significantly underperforms when dealing with noisy contexts, highlighting the importance of selecting the right frames. Furthermore, we introduce several frame selection strategies based on query-frame similarity and analyze their effectiveness in enhancing VLM performance without retraining. We find that inverse transform sampling with refined query descriptions yields the most substantial performance improvement, boosting accuracy on the Video-MME benchmark from 49.94% to 53.8%. Our code will be released.
Poster
Zhenpeng Huang · Xinhao Li · Jiaqi Li · Jing Wang · Xiangyu Zeng · Cheng Liang · Tao Wu · Xi Chen · Liang Li · Limin Wang

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) have shown significant progress in video understanding. However, applying these models to real-world streaming scenarios, such as autonomous driving, augmented reality, and surveillance, presents unique challenges due to real-time dynamics. This paper introduces the Online Spatial-Temporal Video Understanding Benchmark, OVBench, designed specifically to evaluate models’ capacity to interpret spatiotemporal features in streaming contexts. Unlike existing offline benchmarks, OVBench integrates six core task types, each defined across three temporal contexts (past, current, and future) to capture real-time complexity, resulting in a comprehensive set of 15 subtasks based on diverse datasets. To address the unique data constraints and architectural challenges in streaming video understanding, we present our strong baseline model, VideoChat-Online. This model incorporates a hierarchical memory bank architecture that effectively balances spatial and temporal representation, achieving state-of-the-art performance across online and offline scenarios. Our approach surpasses existing state of art offline models Qwen2-VL 7B and online models Flash-VStream, by 4.19\% and 23.7\% on OVBench, respectively. The results demonstrate VideoChat-Online’s efficacy in providing real-time responses while maintaining high accuracy in spatiotemporal comprehension.
Poster
Gengyuan Zhang · Mang Ling Ada Fok · Jialu Ma · Yan Xia · Philip H.S. Torr · Daniel Cremers · Volker Tresp · Jindong Gu

[ ExHall D ]

Abstract
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries— especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a Surrogate Fine-tuning on pseudo-MQs strategy. We systematically benchmark 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse data domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization. Our code and dataset will be publicly available.
Poster
Junho Kim · Hyunjun Kim · Hosu Lee · Yong Man Ro

[ ExHall D ]

Abstract
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to …
Poster
Sheng Zhou · Junbin Xiao · Qingyun Li · Yicong Li · Xun Yang · Dan Guo · Meng Wang · Tat-seng Chua · Angela Yao

[ ExHall D ]

Abstract
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
Poster
Felix Vogel · Walid Bousselham · Anna Kukleva · Nina Shvetsova · Hilde Kuehne

[ ExHall D ]

Abstract
Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging as actions have less physical outline and are usually described by higher-level concepts.In this work, we propose VideoGEM, the first training-free action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to activity grounding. By doing so, we observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer’s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for video …
Poster
Aaryan Garg · Akash Kumar · Yogesh S. Rawat

[ ExHall D ]

Abstract
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1. Code and models will be released publicly.
Poster
Claudia Cuttano · Gabriele Trivigno · Gabriele Rosi · Carlo Masone · Giuseppe Averta

[ ExHall D ]

Abstract
Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip.Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, \ours, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of just 4.2 M parameters.
Poster
Nan Huang · Wenzhao Zheng · Chenfeng Xu · Kurt Keutzer · Shanghang Zhang · Angjoo Kanazawa · Qianqian Wang

[ ExHall D ]

Abstract
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects.
Poster
Haiyang Mei · Pengyu Zhang · Mike Zheng Shou

[ ExHall D ]

Abstract
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce two key innovations: (i) a novel memory-as-prompt mechanism that leverages object memory to ensure segmentation consistency across dynamic scenes; and (ii) a new memory filtering mechanism designed to select the most informative historical information, thereby avoiding errors and noise interference and enhancing the stability of segmentation. Comprehensive experiments demonstrate that our method achieves over 90\% of SAM 2's performance while using only 0.2\% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader …
Poster
Andrei Dumitriu · Florin Tatui · Florin Miron · Aakash Ralhan · Radu Tudor Ionescu · Radu Timofte

[ ExHall D ]

Abstract
Rip currents are strong, localized and narrow currents of water that flow outwards into the sea causing numerous beach-related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and a lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large-scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring 140 videos (204,056 frames), out of which 115 (161,600 frames) videos are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations, across multiple global locations, including the USA, Greece, Portugal, Romania, Costa Rica, Sri Lanka, New Zealand and Australia. Most videos are annotated at 5 FPS to ensure accuracy in dynamic scenarios, supplemented by an additional 25 videos (42,456 frames) without rip currents. We conduct comprehensive ablation studies on Mask R-CNN, Cascade Mask R-CNN, YOLACT, and YOLO11, fine-tuning these models for the task of rip current segmentation. Baseline results are reported across standard metrics, with a particular focus on the F2 score …
Poster
Olga Zatsarynna · Emad Bahrami · Yazan Abu Farha · Gianpiero Francesca · Jürgen Gall

[ ExHall D ]

Abstract
Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets—Breakfast, 50Salads, and Assembly101—while also significantly improving computational and memory efficiency.
Poster
yilong wang · Zilin Gao · Qilong Wang · Zhaofeng Chen · Peihua Li · Qinghua Hu

[ ExHall D ]

Abstract
Going beyond few-shot action recognition (FSAR), cross-domain FSAR (CDFSAR) has attracted recent research interests by solving the domain gap lying in source-to-target transfer learning. Existing CDFSAR methods mainly focus on joint training of source and target data to mitigate the side effect of domain gap. However, such kind of methods suffer from two limitations: First, pair-wise joint training requires retraining deep models in case of one source data and multiple target ones, which incurs heavy computation cost, especially for large source and small target data. Second, pre-trained models after joint training are adopted to target domain in a straightforward manner, hardly taking full potential of pre-trained models and then limiting recognition performance. To overcome above limitations, this paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Specifically, our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source. To effectively and efficiently explore the potential of pre-trained models in transferring to target domain, our TAMT proposes a Hierarchical Temporal Tuning Network (HTTN), whose core involves local temporal-aware adapters (TAA) and a global temporal-aware moment tuning (GTMT). Particularly, TAA learns few …
Poster
Shaopeng Yang · Jilong Wang · Saihui Hou · Xu Liu · Chunshui Cao · Liang Wang · Yongzhen Huang

[ ExHall D ]

Abstract
Gait sequences exhibit sequential structures and contextual relationships similar to those in natural language, where each element—whether a word or a gait step—is connected to its predecessors and successors. This similarity enables the transformation of gait sequences into "texts" containing identity-related information. Large Language Models (LLMs), designed to understand and generate sequential data, can thus be utilized for gait sequence modeling to enhance gait recognition performance. Leveraging these insights, we make a pioneering effort to apply LLMs to gait recognition, which we refer to as GaitLLM. Specifically, we propose the Gait-to-Language (G2L) module, which converts gait sequences into a textual format suitable for LLMs, and the Language-to-Gait (L2G) module, which maps the LLM's output back to the gait feature space, thereby bridging the gap between LLM outputs and gait recognition. Notably, GaitLLM leverages the powerful modeling capabilities of LLMs without relying on complex architectural designs, improving gait recognition performance with only a small number of trainable parameters. Our method achieves state-of-the-art results on four popular gait datasets—SUSTech1K, CCPG, Gait3D, and GREW—demonstrating the effectiveness of applying LLMs in this domain. This work highlights the potential of LLMs to significantly enhance gait recognition, paving the way for future research and practical applications.
Poster
Lorenzo Mur-Labadia · Jose J. Guerrero · Ruben Martinez-Cantin

[ ExHall D ]

Abstract
Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer’s engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV-FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor-based components while integrating both image and video-language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long-term, spatio-temporal scene understanding.
Poster
Shengeng Tang · Jiayi He · Lechao Cheng · Jingjing Wu · Dan Guo · Richang Hong

[ ExHall D ]

Abstract
Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
Poster
Zezeng Li · Xiaoyu Du · Na Lei · Liming Chen · Weimin Wang

[ ExHall D ]

Abstract
Adversarial attacks exploit the vulnerability of deep models against adversarial samples. Existing point cloud attackers are tailored to specific models, iteratively optimizing perturbations based on gradients in either a white-box or black-box setting. Despite their promising attack performance, they often struggle to produce transferable adversarial samples due to overfitting the specific parameters of surrogate models. To overcome this issue, we shift our focus to the data distribution itself and introduce a novel approach named NoPain, which employs optimal transport (OT) to identify the inherent singular boundaries of the data manifold for cross-network point cloud attacks. Specifically, we first calculate the OT mapping from noise to the target feature space, then identify singular boundaries by locating non-differentiable positions. Finally, we sample along singular boundaries to generate adversarial point clouds. Once the singular boundaries are determined, NoPain can efficiently produce adversarial samples without the need of iterative updates or guidance from the surrogate classifiers. Extensive experiments demonstrate that the proposed end-to-end method outperforms baseline approaches in terms of both transferability and efficiency, while also maintaining notable advantages even against defense strategies. The source code will be publicly available.
Poster
Li Lin · Santosh Santosh · Mingyang Wu · Xin Wang · Shu Hu

[ ExHall D ]

Abstract
AI-generated faces have enriched human life, such as entertainment, education, and art. However, they also pose misuse risks. Therefore, detecting AI-generated faces becomes crucial, yet current detectors show biased performance across different demographic groups. Mitigating biases can be done by designing algorithmic fairness methods, which usually require demographically annotated face datasets for model training. However, no existing dataset encompasses both demographic attributes and diverse generative methods simultaneously, which hinders the development of fair detectors for AI-generated faces. In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. Based on this dataset, we conduct the first comprehensive fairness benchmark to assess various AI face detectors and provide valuable insights and findings to promote the future fair design of AI face detectors. Our AI-Face dataset and benchmark code are publicly available at https://anonymous.4open.science/r/AI_Face_FairnessBench-E417.
Poster
Fengfan Zhou · Bangjie Yin · Hefei Ling · Qianyu Zhou · Wenxuan Wang

[ ExHall D ]

Abstract
Face Recognition (FR) models are vulnerable to adversarial examples that subtly manipulate benign face images, underscoring the urgent need to improve the transferability of adversarial attacks in order to expose the blind spots of these systems. Existing adversarial attack methods often overlook the potential benefits of augmenting the surrogate model with diverse initializations, which limits the transferability of the generated adversarial examples. To address this gap, we propose a novel method called Diverse Parameters Augmentation (DPA) attack method, which enhances surrogate models by incorporating diverse parameter initializations, resulting in a broader and more diverse set of surrogate models. Specifically, DPA consists of two key stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA). In the DPO stage, we initialize the parameters of the surrogate model using both pre-trained and random parameters. Subsequently, we save the models in the intermediate training process to obtain a diverse set of surrogate models. During the HMA stage, we enhance the feature maps of the diversified surrogate models by incorporating beneficial perturbations optimized in directions opposite to those of adversarial perturbations, thereby further improving the transferability. Experimental results demonstrate that our proposed attack method can effectively enhance the transferability of the crafted adversarial face …
Poster
Mohammad Saadabadi Saadabadi · Sahar Rahimi Malakshan · Ali Dabouei · Srinjoy Das · Jeremy M. Dawson · Nasser M Nasrabadi

[ ExHall D ]

Abstract
Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities.Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. In this work, we present a simple yet effective method that substitutes scalar labels with structured identity code, \ie, a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic \wrt number of identities. We demonstrate the benefits of the proposed method by conducting experiments on LFW, CFP-FP, CPLFW, CALFW, AgeDB, IJB-B, and IJB-C using different backbone network architectures. In particular, with less training computational load, our method outperforms its competitors by 1.52\%, and 0.6\% at TAR@FAR=1e4 on …
Poster
Li Lun · Kunyu Feng · Qinglong Ni · Ling Liang · Yuan Wang · Ying Li · dunshan yu · Xiaoxin CUI

[ ExHall D ]

Abstract
Spiking neural networks (SNNs) have shown their competence in handling spatial-temporal event-based data with low energy consumption. Similar to conventional artificial neural networks (ANNs), SNNs are also vulnerable to gradient-based adversarial attacks, wherein gradients are calculated by spatial-temporal back-propagation (STBP) and surrogate gradients (SGs). However, the SGs may be invisible for an inference-only model as they do not influence the inference results, and current gradient-based attacks are ineffective for binary dynamic images captured by the dynamic vision sensor (DVS). While some approaches addressed the issue of invisible SGs through universal SGs, their SGs lack a correlation with the victim model, resulting in sub-optimal performance. Moreover, the imperceptibility of existing SNN-based binary attacks is still insufficient. In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. Additionally, we propose the sparse dynamic attack (SDA) to effectively attack binary dynamic images. Utilizing a generation-reduction paradigm, SDA can fully optimize the sparsity of adversarial perturbations. Experimental results demonstrate that our PDSG and SDA outperform state-of-the-art SNN-based attacks across various models and datasets. Specifically, our PDSG achieves 100% …
Poster
Ziqi Li · Tao Gao · Yisheng An · Ting Chen · Jing Zhang · Yuanbo Wen · Mengkun Liu · Qianxi Zhang

[ ExHall D ]

Abstract
Brain-inspired spiking neural networks (SNNs) have the capability of energy-efficient processing of temporal information. However, leveraging the rich dynamic characteristics of SNNs and prior works in artificial neural networks (ANNs) to construct an effective object detection model for visual tasks remains an open question for further exploration. To develop a directly-trained , low energy consumption and high-performance multi-scale SNN model, we propose a novel interpretable object detection framework Multi-scale Spiking Detector (MSD). Initially, we propose a spiking convolutional neuron as a core component of the Optic Nerve Nucleus Block (ONNB), designed to significantly enhance the deep feature extraction capabilities of SNNs. ONNB enables direct training with improved energy efficiency, demonstrating superior performance compared to state-of-the-art ANN-to-SNN conversion and SNN techniques. In addition, we propose a Multi-scale Spiking Detection Framework to emulate the biological response and comprehension of stimuli from different objects. Wherein, spiking multi-scale fusion and the spiking detector are employed to integrate features across different depths and to detect response outcomes, respectively. Our method outperforms state-of-the-art ANN detectors, with only 7.8 M parameters and 6.43 mJ energy consumption. MSD obtains the mean average precision (mAP) of 62.0% and 66.3% on COCO and Gen1 datasets, respectively.
Poster
Tian Gao · Yu Zhang · Zhiyuan Zhang · Huajun Liu · Kaijie Yin · Cheng-Zhong Xu · Hui Kong

[ ExHall D ]

Abstract
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN), offering a potential solution to the deployment challenges faced by Vision Transformers (ViTs) on edge devices. However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Initially, BHViT utilizes the local information interaction and hierarchical feature aggregation technique from coarse to fine levels to address redundant computations stemming from excessive tokens. Then, a novel module based on shift operations is proposed to enhance the performance of the binary Multilayer Perceptron (MLP) module without significantly increasing computational overhead. In addition, an innovative attention matrix binarization method based on quantization decomposition is proposed to evaluate the token's importance in the binarized attention matrix. Finally, we propose a regularization loss to address the inadequate optimization caused by the incompatibility between the weight oscillation in the binary layers and the Adam Optimizer. Extensive experimental results demonstrate that our proposed algorithm achieves SOTA performance among binary ViT …
Poster
Zhenyu Cui · Jiahuan Zhou · Yuxin Peng

[ ExHall D ]

Abstract
Lifelong person re-identification (LReID) aims to match the same person using sequentially collected data. However, due to the long-term nature of lifelong learning, the inevitable changes in human clothes prevent the model from relying on unified discriminative information (e.g., clothing style) to match the same person in the streaming data, demanding differentiated cloth-irrelevant information. Unfortunately, existing LReID methods typically fail to leverage such knowledge resulting in the exacerbation of catastrophic forgetting issues. Therefore, in this paper, we focus on a challenging practical task called Cloth-Hybrid Lifelong Person Re-identification (CH-LReID), which requires matching the same person wearing different clothes using sequentially collected data. A Differentiated Knowledge Consolidation (DKC) framework is designed to unify and balance distinct knowledge across streaming data. The core idea is to adaptively balance differentiated knowledge and compatibly consolidate cloth-relevant and cloth-irrelevant information. To this end, a Differentiated Knowledge Transfer (DKT) module and a Latent Knowledge Consolidation (LKC) module are designed to adaptively discover differentiated new knowledge, while eliminating the derived domain shift of old knowledge via reconstructing the old latent feature space, respectively. Then, to further alleviate the catastrophic conflict between differentiated new and old knowledge, we further propose a Dual-level Distribution Alignment (DDA) module to align …
Poster
Jingwei Zhang · Anh Tien Nguyen · Xi Han · Vincent Quoc-Huy Trinh · Hong Qin · Dimitris Samaras · Mahdi Hosseini

[ ExHall D ]

Abstract
Efficiently modeling large 2D contexts is essential for various fields including Giga-Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer-based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba improves up to 2.48% in AUC, 3.11% in F1 score, 2.47% in accuracy and 5.52% in C-index. Additionally, integrating our method with VMamba for natural imaging …
Poster
Jeffri Erwin Murrugarra Llerena · José Henrique Marques · Claudio Jung

[ ExHall D ]

Abstract
Oriented Object Detection (OOD) has received increased attention in the past years, being a suitable solution for detecting elongated objects in remote sensing analysis. In particular, using regression loss functions based on Gaussian distributions has become attractive since they yield simple and differentiable terms. However, existing solutions are still based on regression heads that produce Oriented Bounding Boxes (OBBs), and the known problem of angular boundary discontinuity persists. In this work, we propose a regression head for OOD that directly produces Gaussian distributions based on the Cholesky matrix decomposition. The proposed head, named Gaucho, theoretically mitigates the boundary discontinuity problem and is fully compatible with recent Gaussian-based regression loss functions. Furthermore, we advocate using Oriented Ellipses (OEs) to represent oriented objects, which relates to GauCho through a bijective function and alleviates the encoding ambiguity problem for circular objects. Our experimental results show that GauCho can be a viable alternative to the traditional OBB head, achieving results comparable to or better than state-of-the-art detectors for the challenging dataset DOTA. Our code will be available at \url{omitted-for-blind-review}.
Poster
Biplab Das · Viswanath Gopalakrishnan

[ ExHall D ]

Abstract
In this work, we introduce Camouflage Anything, a novel and robust approach to generate camouflaged datasets. To the best of our knowledge, we are the first to apply Controlled Out-painting and Representation Engineering (CO + RE) for generating realistic camouflaged images with an objective to hide any segmented object coming from a generic or salient database. Our proposed method uses a novel control design to out-paint a given segmented object, with a camouflaged background. We also uncover the role of representation engineering in enhancing the quality of generated camouflage datasets. We address the limitations of existing metrics FID and KID in capturing the 'camouflage quality',by proposing a novel metric namely, CamOT. CamOT uses Optimal Transport between foreground & background (boundary) Gaussian Mixture Models (GMM) of concerned camouflaged object to assign an image quality score. Furthermore, we conduct LoRA-based fine-tuning of the robust BiRefNet baseline with our generated camouflaged datasets, leading to notable improvements in camouflaged object segmentation accuracy. The experimental results showcase the efficacy and potential of Camouflage Anything, outperforming existing methods in camouflaged generation tasks.
Poster
Datao Tang · Xiangyong Cao · Xuan Wu · Jialin Li · Jing Yao · Xueru Bai · Dongsheng Jiang · Yin Li · Deyu Meng

[ ExHall D ]

Abstract
Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively.
Poster
Zhe Shan · Yang Liu · Lei Zhou · Cheng Yan · Heng Wang · Xia Xie

[ ExHall D ]

Abstract
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose RSVOS-SAM, a method designed to achieve high-quality interactive masks while preserving generalization across diverse remote sensing data. The RSVOS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM’s generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we redesign the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while RSVOS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, RSVOS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm RSVOS-SAM as a powerful tool for fine-grained segmentation in …
Poster
Phuc Nguyen · Minh Luu · Anh Tran · Cuong Pham · Khoi Nguyen

[ ExHall D ]

Abstract
Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
Poster
Joey Wilson · Marcelino M. de Almeida · Sachit Mahajan · Martin Labrie · Maani Ghaffari · Omid Ghasemalizadeh · Min Sun · Cheng-Hao Kuo · Arnab Sen

[ ExHall D ]

Abstract
In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D-GS) through P-Optimality. While 3D-GS has proven to be a useful world model with high-quality rasterizations, it does not natively quantify uncertainty. Quantifying uncertainty in parameters of 3D-GS is necessary to understand the information gained from acquiring new images as in active perception, or identify redundant images which can be removed from memory due to resource constraints in online 3D-GS SLAM. We propose to quantify uncertainty and information gain in 3D-GS by reformulating the problem through the lens of optimal experimental design, which is a classical solution to measuring information gain. By restructuring information quantification of 3D-GS through optimal experimental design, we arrive at multiple solutions, of which T-Optimality and D-Optimality perform the best quantitatively and qualitatively as measured on two popular datasets. Additionally, we propose a block diagonal approximation of the 3D-GS uncertainty, which provides a measure of correlation for computing more accurate information gain, at the expense of a greater computation cost.
Poster
Runsong Zhu · Shi Qiu · ZHENGZHE LIU · Ka-Hei Hui · Qianyi Wu · Pheng-Ann Heng · Chi-Wing Fu

[ ExHall D ]

Abstract
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods for addressing multi-view inconsistency in 2D segmentations either use linear assignment for end-to-end lifting, resulting in inferior results, or adopt a two-stage solution, which is limited by complex pre- or post-processing. In this work, we design a new object-aware lifting approach to realize an end-to-end lifting pipeline based on the 3D Gaussian representation, such that we can jointly learn Gaussian-level point features and a global object-level codebook across multiple views. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable \textit{object-level codebook} to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. Further, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our approach clearly outperforms existing methods in terms of segmentation quality and time efficiency.
Poster
Wenxuan Guo · Xiuwei Xu · Ziwei Wang · Jianjiang Feng · Jie Zhou · Jiwen Lu

[ ExHall D ]

Abstract
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with …
Poster
Yijie Tang · Jiazhao Zhang · Yuqing Lan · Yulan Guo · Dezun Dong · Chenyang Zhu · Kai Xu

[ ExHall D ]

Abstract
Online 3D open-vocabulary segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential—yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from O(n2) to O(n). Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, open-vocabulary 3D instance segmentation with leading efficiency.
Poster
Muzhi Zhu · Yuzhuo Tian · Hao Chen · Chunluan Zhou · Qingpei Guo · Yang Liu · Ming Yang · Chunhua Shen

[ ExHall D ]

Abstract
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM’s text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model’s intrinsic pixel-level understanding.Thus, We introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to SOTA methods and supports additional tasks like mask refinement and annotation filtering.HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs’ visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM guided tree search further enhance model robustness in …
Poster
Savya Khosla · Sethuraman T V · Alexander G. Schwing · Derek Hoiem

[ ExHall D ]

Abstract
We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Poster
Rong Li · Shijie Li · Lingdong Kong · Xulei Yang · Junwei Liang

[ ExHall D ]

Abstract
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. SeeGround represents 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7\% on ScanRefer and 7.1\% on Nr3D, showcasing its effectiveness in complex 3DVG task.
Poster
Zhenyang Liu · Yikai Wang · Sixiao Zheng · Tongying Pan · Longfei Liang · Yanwei Fu · Xiangyang Xue

[ ExHall D ]

Abstract
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. To address this, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the Segment Anything Model (SAM) and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
Poster
Ziyang Zhou · Pinghui Wang · Zi Liang · Haitao Bai · Ruofei Zhang

[ ExHall D ]

Abstract
The advancement of 3D understanding and representation is a crucial step for the next phase of autonomous driving, robotics, augmented and virtual reality, 3D gaming and 3D e-commerce products. However, existing research has primarily focused on point clouds to perceive 3D objects and scenes, overlooking the rich visual details offered by multi-view images, thereby limiting the potential of 3D semantic representation. This paper introduces OpenView, a novel representation method that integrates both point clouds and multi-view images to form a unified 3D representation. OpenView comprises a unique fusion framework, sequence-independent modeling, a cross-modal fusion encoder, and a progressive hard learning strategy. Our experiments demonstrate that OpenView outperforms the state-of-the-art by 11.5% and 5.5% on the R@1 metric for cross-modal retrieval and the Top-1 metric for zero-shot classification tasks, respectively. Furthermore, we showcase some applications of OpenView: 3D retrieval, 3D captioning and hierarchical data clustering, highlighting its generality in the field of 3D representation learning.
Poster
Austin Stone · Hagen Soltau · Robert Geirhos · Xi Yi · Ye Xia · Bingyi Cao · Kaifeng Chen · Abhijit Ogale · Jonathon Shlens

[ ExHall D ]

Abstract
Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on …
Poster
Keyu Guo · Yongle Huang · Shijie Sun · Xiangyu Song · Mingtao Feng · Zedong Liu · Huansheng Song · Tiantian Wang · Jianxin Li · Naveed Akhtar · Ajmal Mian

[ ExHall D ]

Abstract
Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding. However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world. To this end, we construct a large-scale benchmark dataset, MonoMulti3D-ROPE, and propose a model, CyclopsNet that integrates a State-Prompt Visual Encoder (SPVE) module with a Denoising Alignment Fusion (DAF) module to achieve robust multi-modal semantic alignment and fusion. This leads to more stable and robust multi-modal joint representations for downstream tasks. Experimental results show that our method significantly outperforms existing techniques on the MonoMulti3D-ROPE dataset. Our dataset and code are available in Supplementary Material to ease reproducibility.
Poster
Hongyan Zhi · Peihao Chen · Junyan Li · Shuailei Ma · Xinyu Sun · Tianhang Xiang · Yinjie Lei · Mingkui Tan · Chuang Gan

[ ExHall D ]

Abstract
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objectsand consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large …
Poster
Jiajun Deng · Tianyu He · Li Jiang · Tianyu Wang · Feras Dayoub · Ian Reid

[ ExHall D ]

Abstract
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce \textbf{3D-LLaVA}, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines—such as offline multi-view feature extraction or additional task-specific heads—3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a \textbf{visual feature selector} that converts and selects visual tokens, (2) a \textbf{visual prompt encoder} that embeds interactive visual prompts into the visual token space, and (3) a \textbf{referring mask decoder} that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks. The code and model will be …
Poster
Benlin Liu · Yuhao Dong · Yiqin Wang · Zixian Ma · Yansong Tang · Luming Tang · Yongming Rao · Wei-Chiu Ma · Ranjay Krishna

[ ExHall D ]

Abstract
Multimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural designs or task-specific fine-tuning to achieve this. We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs’ spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints, and then conveys this information to MLLMs through visual prompting. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks that require spatial-temporal reasoning, including +20.5% improvement on ScanQA, +9.7% on OpenEQA’s episodic memory subset, +6.0% on the long-form video benchmark EgoSchema, and +11% on the R2R navigation benchmark. Additionally, we show that Coarse Correspondences can also enhance open-source MLLMs’ spatial reasoning (by +6.9% on ScanQA) when applied in both training and inference and that the improvement can generalize to unseen datasets such as SQA3D (+3.1%). Taken together, we show that Coarse Correspondences effectively and efficiently boosts models’ performance on downstream tasks requiring spatial-temporal reasoning.
Poster
Efstathios Karypidis · Ioannis Kakogeorgiou · Spyros Gidaris · Nikos Komodakis

[ ExHall D ]

Abstract
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting.
Poster
Weiming Ren · Huan Yang · Jie Min · Cong Wei · Wenhu Chen

[ ExHall D ]

Abstract
Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective video spatiotemporal augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework. Our dataset, code and finetuned models will be released.
Poster
Haoqiang Kang · Enna Sachdeva · Piyush Gupta · Sangjae Bae · Kwonjoon Lee

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.
Poster
Cheng Chen · Yunpeng Zhai · Yifan Zhao · Jinyang Gao · Bolin Ding · Jia Li

[ ExHall D ]

Abstract
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify that our approach achieves significant performance improvements on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs. The code will be publicly available.
Poster
Mahtab Bigverdi · Zelun Luo · Cheng-Yu Hsieh · Ethan Shen · Dongping Chen · Linda Shapiro · Ranjay Krishna

[ ExHall D ]

Abstract
Multimodal language models (MLMs) continue to struggle with fundamental visual perception tasks that specialist models excel at tasks that require reasoning over 3D benefit from depth estimation; similarly, reasoning 2D over object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over.Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient.To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively.We propose Aurora, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. Aurora employs a VQVAE to transform intermediate image representations, such as depth maps or bounding boxes, into a tokenized format, which is then used in a multi-task training framework.Aurora achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves …
Poster
Akhil Perincherry · Jacob Krantz · Stefan Lee

[ ExHall D ]

Abstract
Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations", we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.
Poster
Fan Yang · Ru Zhen · Jianing Wang · Yanhao Zhang · Haoxiang Chen · Haonan Lu · Sicheng Zhao · Guiguang Ding

[ ExHall D ]

Abstract
AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details; and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation.To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach.Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments. Our dataset and code are included in the …
Poster
Ailin Deng · Tri Cao · Zhirui Chen · Bryan Hooi

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings.By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a \emph{blind faith in text''} phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns.We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training.Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance …
Poster
Christopher Chou · Lisa Dunlap · Wei-Lin Chiang · Koki Mashita · Krishna Mandal · Trevor Darrell · Ion Stoica · Joseph Gonzalez

[ ExHall D ]

Abstract
The growing adoption and capabilities of vision-language models (VLMs) demand benchmarks that reflect real-world user interactions. We introduce VisionArena, the largest existing dataset of crowdsourced real-world conversations between users and VLMs. While most visual question-answering datasets focus on close-ended problems or synthetic scenarios, VisionArena contains a wide variety of closed and open ended problems across 230K conversations, 73K unique users, 138 languages, and 45 VLMs. VisionArena consists of VisionArena-Chat, a set of 200k single-turn and multi-turn chat logs with a VLM, VisionArena-Battle, a set of 30K conversations between a user and 2 anonymous VLMs with preference votes, and VisionArena-Bench, an automatic benchmark consisting of 500 diverse user prompts which can be used to cheaply approximate model rankings.We analyze these datasets and highlight the types of question asked by users, the influence of style on user preference, and areas where models often fall short. We find that more open-ended questions like captioning and humor are heavily influenced by style, which causes certain models like Reka Flash and InternVL which are tuned for style to perform significantly better on these categories compared to other categories. We show that VisionArena-Chat and VisionArena-Battle can be used for post-training to align VLMs to human preferences …
Poster
Wen Yin · Yong Wang · Guiduo Duan · Dongyang Zhang · XIN Hu · Yuan-Fang Li · Tao He

[ ExHall D ]

Abstract
Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring the emotional states of individuals based on visual cues. However, recent approaches predominantly focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To address this limitation, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework for UCDVER. Specifically, KCDP first leverages a vision-language model to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our approach surpasses state-of-the-art models in both perceptibility and generalization, e.g., gaining 12% improvements over SOTA VER model TGCA-PVT.
Poster
Yangyu Huang · Tianyi Gao · Haoran Xu · Qihao Zhao · Yang Song · Zhipeng Gui · Tengchao Lv · Hao Chen · Lei Cui · Scarlett Li · Furu Wei

[ ExHall D ]

Abstract
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering.Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge.To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing.To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding,which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA).Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions.Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o.Our work, em**P**owering g**E**ologic m**A**p holisti**C** und**E**rstanding (**PEACE**) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.
Poster
Chengyue Huang · Brisa Maneechotesuwan · Shivang Chopra · Zsolt Kira

[ ExHall D ]

Abstract
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness Across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. We will release our code to support future work in this area.
Poster
Miran Heo · Min-Hung Chen · De-An Huang · Sifei Liu · Subhashree Radhakrishnan · Seon Joo Kim · Yu-Chiang Frank Wang · Ryo Hachiuma

[ ExHall D ]

Abstract
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of discretized tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and language tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across video sequences.Additionally, we introduce RegVID-300k, a large-scale region-level video instruction dataset curated from diverse public video sources. Omni-RGPT achieves state-of-the-art results on visual commonsense reasoning benchmarks in image-based (VCR) and video-based (Causal-VidQA) tasks while also demonstrating strong performance in captioning and Referring Expression Comprehension (REC) tasks.
Poster
Wenbo Chen · Zhen Xu · Ruotao Xu · Si Wu · Hau San Wong

[ ExHall D ]

Abstract
The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.
Poster
Yue Han · Jiangning Zhang · Junwei Zhu · Runze Hou · Xiaozhong Ji · Chuming Lin · Xiaobin Hu · Xuezhucun Xue · Yong Liu

[ ExHall D ]

Abstract
Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the \textbf{\textit{FacePlayGround-240K}} dataset, the first pioneering large-scale, pixel-grounded face caption and question-answer (QA) dataset, meticulously curated for alignment pretraining and instruction-tuning. We present the \textbf{\textit{GroundingFace}} framework, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.
Poster
Yang Bai · Yucheng Ji · Min Cao · Jinqiao Wang · Mang Ye

[ ExHall D ]

Abstract
Traditional text-based person retrieval (TPR) relies on a single-shot text as query to retrieve the target person, assuming that the query completely captures the user's search intent. However, in real-world scenarios, it can be challenging to ensure the information completeness of such single-shot text. To address this limitation, we propose chat-based person retrieval (ChatPR), a new paradigm that takes an interactive dialogue as query to perform the person retrieval, engaging the user in conversational context to progressively refine the query for accurate person retrieval. The primary challenge in ChatPR is the lack of available dialogue-image paired data. To overcome this challenge, we establish ChatPedes, the first dataset designed for ChatPR, which is constructed by leveraging large language models to automate the question generation and simulate user responses. Additionally, to bridge the modality gap between dialogues and images, we propose a dialogue-refined cross-modal alignment (DiaNA) framework, which leverages two adaptive attribute refiners to bottleneck the conversational and visual information for fine-grained cross-modal alignment. Moreover, we propose a dialogue-specific data augmentation strategy, random round retaining, to further enhance the model's generalization ability across varying dialogue lengths. Extensive experiments demonstrate that DiaNA significantly outperforms existing TPR approaches, highlighting the effectiveness of conversational interactions …
Poster
ruotian peng · Haiying He · Yake Wei · Yandong Wen · Di Hu

[ ExHall D ]

Abstract
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a divide-then-aggregate strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining.
Poster
Likai Tian · Jian Zhao · Zechao Hu · Zhengwei Yang · Hao Li · Lei Jin · Zheng Wang · Xuelong Li

[ ExHall D ]

Abstract
Composed Image Retrieval (CIR) is a multi-modal task that seeks to retrieve target images by harmonizing a reference image with a modified instruction. The main challenge in CIR lies in compositional conflicts between the reference image (e.g., blue, long sleeve) and the modified instruction (e.g., grey, short sleeve). Previous works attempt to mitigate such conflicts through feature-level manipulation, commonly employing learnable masks to obscure conflicting features within the reference image. However, the inherent complexity of feature spaces poses significant challenges in precise conflict neutralization, thereby leading to uncontrollable results. To this end, this paper proposes the Compositional Conflict Identification and Neutralization (CCIN) framework, which sequentially identifies and neutralizes compositional conflicts for effective CIR. Specifically, CCIN comprises two core modules: 1) Compositional Conflict Identification module, which utilizes LLM-based analysis to identify specific conflicting attributes, and 2) Compositional Conflict Neutralization module, which first generates a kept instruction to preserve non-conflicting attributes, then neutralizes conflicts under collaborative guidance of both the kept and modified instructions. Furthermore, an orthogonal parameter regularization loss is introduced to emphasize the distinction between target and conflicting features. Extensive experiments demonstrate the superiority of CCIN over the state-of-the-arts.
Poster
You Li · Fan Ma · Yi Yang

[ ExHall D ]

Abstract
The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions.Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text.In this paper, we introduce Imagined Proxy for CIR, a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process.We first leverage the large language model’s generalization capability to generate an image layout, and then apply both the query text and image for conditional generation.The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process.Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising …
Poster
Chuong Huynh · Jinyu Yang · Ashish Tawari · Mubarak Shah · Son Dinh Tran · Raffay Hamid · Trishul Chilimbi · Abhinav Shrivastava

[ ExHall D ]

Abstract
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM …
Poster
Zhenxing Zhang · Yaxiong Wang · Lechao Cheng · Zhun Zhong · Dan Guo · Meng Wang

[ ExHall D ]

Abstract
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.
Poster
Yikun Liu · Pingan Chen · jiayin cai · Xiaolong Jiang · Yao Hu · Jiangchao Yao · Yanfeng Wang · Weidi Xie

[ ExHall D ]

Abstract
With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables to unify of all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce **LamRA**, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks. The code and model will be made publicly available for reproduction.
Poster
Yuchen Duan · Zhe Chen · Yusong Hu · Weiyun Wang · Shenglong Ye · Botian Shi · Lewei Lu · Qibin Hou · Tong Lu · Hongsheng Li · Jifeng Dai · Wenhai Wang

[ ExHall D ]

Abstract
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets.While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from original documents.Building on the dataset, we developed a native multimodal model—Docopilot, which can accurately handle document-level dependencies without relying on RAG.Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models shall be released.
Poster
Wenhui Liao · Jiapeng Wang · Hongliang Li · Chengyu Wang · Jun Huang · Lianwen Jin

[ ExHall D ]

Abstract
Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors.
Poster
Jeongryong Lee · Yejee Shin · Geonhui Son · Dosik Hwang

[ ExHall D ]

Abstract
The modality gap between vision and text embeddings in CLIP presents a significant challenge for zero-shot image captioning, limiting effective cross-modal representation. Traditional approaches, such as noise injection and memory-based similarity matching, attempt to address this gap, yet these methods either rely on indirect alignment or relatively naive solutions with heavy computation. Diffusion Bridge introduces a novel approach to directly reduce this modality gap by leveraging Denoising Diffusion Probabilistic Models (DDPM), trained exclusively on text embeddings to model their distribution. Our approach is motivated by the observation that, while paired vision and text embeddings are relatively close, a modality gap still exists due to stable regions created by the contrastive loss. This gap can be interpreted as noise in cross-modal mappings, which we approximate as Gaussian noise. To bridge this gap, we employ a reverse diffusion process, where image embeddings are strategically introduced at an intermediate step in the reverse process, allowing them to be refined progressively toward the text embedding distribution. This process transforms vision embeddings into text-like representations closely aligned with paired text embeddings, effectively minimizing residual discrepancies. Experimental results demonstrate that these text-like vision embeddings significantly enhance alignment with their paired text embeddings, leading to improved zero-shot …
Poster
Matt Deitke · Christopher Clark · Sangho Lee · Rohun Tripathi · Yue Yang · Jae Sung Park · Reza Salehi · Niklas Muennighoff · Kyle Lo · Luca Soldaini · Jiasen Lu · Taira Anderson · Erin Bransom · Kiana Ehsani · Huong Ngo · Yen-Sung Chen · Ajay Patel · Mark Yatskar · Chris Callison-Burch · Andrew Head · Rose Hendrix · Favyen Bastani · Eli VanderBilt · Nathan Lambert · Yvonne Chou · Arnavi Chheda-Kothary · Jenna Sparks · Sam Skjonsberg · Michael Schmitz · Aaron Sarnat · Byron Bischoff · Pete Walsh · Christopher Newell · Piper Wolters · Tanmay Gupta · Kuo-Hao Zeng · Jon Borchardt · Dirk Groeneveld · Crystal Nam · Sophie Lebrecht · Caitlin Wittlif · Carissa Schoenick · Oscar Michel · Ranjay Krishna · Luca Weihs · Noah A. Smith · Hannaneh Hajishirzi · Ross Girshick · Ali Farhadi · Aniruddha Kembhavi

[ ExHall D ]

Abstract
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present \textbf{Molmo}, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets, including a dataset of highly detailed image captions for pre-training called \textbf{PixMo}, a free-form image Q\&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code will all be released.
Poster
Guotao liang · Baoquan Zhang · Zhiyuan Wen · Junteng Zhao · Yunming Ye · Guangming Ye · Yao He

[ ExHall D ]

Abstract
Image quantization is a crucial technique in image generation, aimed at learning a codebook that encodes an image into a discrete token sequence. Recent advancements have seen researchers exploring learning multi-modal codebook (i.e., text-aligned codebook) by utilizing image caption semantics, aiming to enhance codebook performance in cross-modal tasks. However, existing image-text paired datasets exhibit a notable flaw in that the text descriptions tend to be overly concise, failing to adequately describe the images and provide sufficient semantic knowledge, resulting in limited alignment of text and codebook at a fine-grained level.In this paper, we propose a novel Text-Augmented Codebook Learning framework, named TA-VQ, which generates longer text for each image using the visual-language model for improved text-aligned codebook learning.However, the long text presents two key challenges: how to encode text and how to align codebook and text. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence, so that the long text can be fully encoded without losing any key semantic knowledge. Following this, a hierarchical encoder and novel sampling-based alignment strategy are designed to achieve fine-grained codebook-text alignment. Additionally, our method can be seamlessly integrated into existing VQ models. …
Poster
Hyungyu Choi · Young Kyun Jang · Chanho Eom

[ ExHall D ]

Abstract
Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy, detailed text descriptions due to their training focus on concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
Poster
Anjia Cao · Xing Wei · Zhiheng Ma

[ ExHall D ]

Abstract
Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9\% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6\% in text-to-image recall@1 for long-context retrieval on Urban-1k. The code will be released soon.
Poster
Shuokai Pan · Gerti Tuzi · Sudarshan Sreeram · Dibakar Gope

[ ExHall D ]

Abstract
Despite the revolutionary breakthroughs of large-scale text-to-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training …
Poster
Yassine Ouali · Adrian Bulat · ALEXANDROS XENOS · Anestis Zaganidis · Ioannis Maniadis Metaxas · Brais Martinez · Georgios Tzimiropoulos

[ ExHall D ]

Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a bag of words'' behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks.In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding.Our contributions include (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
Poster
Tianyu Chen · Xingcheng Fu · Yisen Gao · Haodong Qian · Yuecen Wei · Kun Yan · Haoyi Zhou · Jianxin Li

[ ExHall D ]

Abstract
Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy-scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation (R² scores up to 0.91) and morphology classification tasks (up to +0.17 F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.
Poster
Zhijian Liu · Ligeng Zhu · Baifeng Shi · Zhuoyang Zhang · Yuming Lou · Shang Yang · Haocheng Xi · Shiyi Cao · Yuxian Gu · Dacheng Li · Xiuyu Li · Haotian Tang · Yunhao Fang · Yukang Chen · Cheng-Yu Hsieh · De-An Huang · An-Chieh Cheng · Jinyi Hu · Sifei Liu · Ranjay Krishna · Pavlo Molchanov · Jan Kautz · Danny Yin · Song Han · Yao Lu

[ ExHall D ]

Abstract
Visual language models (VLMs) have made significant strides in accuracy in recent years, but their efficiency has often been overlooked. This paper presents NVILA, a family of open frontier VLMs designed to optimize both efficiency and accuracy. Building upon VILA, we improve its model architecture by first scaling up spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach allows NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic study to improve NVILA's efficiency throughout its entire lifecycle—from training and fine-tuning to deployment. NVILA is both efficient and accurate, reducing training costs by 4.5×, fine-tuning memory usage by 3.4×, prefilling latency by 1.8× and decoding throughput by 1.2×. It also achieves state-of-the-art results across diverse image and video benchmarks. We will release our implementation and trained models to support full reproducibility.
Poster
Jing Bi · Lianggong Bruce Wen · Zhang Liu · JunJia Guo · Yunlong Tang · Bingjie Wang · Chenliang Xu

[ ExHall D ]

Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.
Poster
Xudong LU · Yinghao Chen · chencheng Chen · Hui Tan · Boheng Chen · yina xie · Rui Hu · Guanxin tan · Renshou Wu · Yan Hu · Yi Zeng · Lei Wu · Liuyang Bian · Zhaoxiong Wang · Long Liu · Yanzhou Yang · Han Xiao · Aojun Zhou · Yafei Wen · Xiaoxin Chen · Shuai Ren · Hongsheng Li

[ ExHall D ]

Abstract
The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present **BlueLM-V-3B**, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) **Small Size**: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) **Fast Speed**: BlueLM-V-3B achieves a generation speed of **24.4** token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) **Strong Performance**: BlueLM-V-3B has attained the highest average score of **66.1** on the OpenCompass benchmark among models with ≤ 4B parameters and surpassed …
Poster
Junyan Lin · Haoran Chen · Yue Fan · Yingqi Fan · Xin Jin · Hui Su · Jinlan Fu · Xiaoyu Shen

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations.
Poster
Xiao Guo · Xiufeng Song · Yue Zhang · Xiaohong Liu · Xiaoming Liu

[ ExHall D ]

Abstract
Deepfake detection is a long-established research topic crucial for combating the spread of malicious misinformation. Unlike previous methods that provide either binary classification results or textual explanations for deepfake detection, we propose a novel method that delivers both simultaneously. Our method harnesses the multi-modal learning power of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and interpretability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs specially designed face forgery prompt learning, integrating zero-shot learning capabilities of the pre-trained CLIP to improve generalization to unseen forgeries.Also, M2F2-Det incorporates the LLM to provide detailed explanations for detection decisions, offering strong interpretability by bridging the gap between natural language and the subtle nuances of facial forgery detection. Empirically, we evaluate M2F2-Det for both detection and sentence generation tasks, on both of which M2F2-Det achieves state-of-the-art performance, showing its effectiveness in detecting and explaining diverse and unseen forgeries. Code and models will be released upon publication.
Poster
Shiyao Li · Yingchun Hu · Xuefei Ning · Xihui Liu · Ke Hong · xiaotao jia · Xiuhong Li · Yaqi Yan · PEI RAN · Guohao Dai · Shengen Yan · Huazhong Yang · Yu Wang

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have already enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on the language modality in large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX …
Poster
Qianhan Feng · Wenshuo Li · Tong Lin · Xinghao Chen

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average …
Poster
Xianwei Zhuang · Zhihong Zhu · Yuxin Xie · Liming Liang · Yuexian Zou

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance …
Poster
Dokyoon Yoon · Youngsook Song · Woomyoung Park

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose TL-DPO, a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that TL-DPO effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.
Poster
Junzhe Chen · Tianshu Zhang · Shiyu Huang · Yuwei Niu · Linfeng Zhang · Lijie Wen · Xuming Hu

[ ExHall D ]

Abstract
Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.
Poster
Tobia Poppi · Tejaswi Kasarla · Pascal Mettes · Lorenzo Baraldi · Rita Cucchiara

[ ExHall D ]

Abstract
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model’s knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling – ineffective in standard vision-language models due to their reliance on Euclidean embeddings – endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition, but also establishes a more adaptable and interpretable framework for …
Poster
Jin Wang · Chenghui Lv · Xian Li · Shichao Dong · Huadong Li · kelu Yao · Chao Li · Wenqi Shao · Ping Luo

[ ExHall D ]

Abstract
Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc.To detect the ever-increasingly **diverse** malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design **robust** forgery detectors due to their impressive performance on a **wide** range of multimodal tasks.However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media.To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries.Forensics-Bench comprises 63,292 meticulously curated multi-choicevisual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models.We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench.We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC.
Poster
Haoyu Zhang · Yangyang Guo · Mohan Kankanhalli

[ ExHall D ]

Abstract
Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor to mitigate the social bias problem in V-L models by removing biased attribute information from model embeddings. However, after our revisiting of these methods, we find that their bias removal is frequently accompanied by greatly compromised V-L alignment capabilities. We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. By doing so, our method achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings. Additionally, we advocate a new evaluation protocol that can 1) holistically quantify the model debiasing and V-L alignment ability, and 2) evaluate the generalization of social bias removal models. We believe this work will offer new insights and guidance for future studies addressing the social bias problem in CLIP.
Poster
Shin'ya Yamaguchi · Dewei Feng · Sekitoshi Kanai · Kazuki Adachi · Daiki Chijiwa

[ ExHall D ]

Abstract
Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks …
Poster
Karsten Roth · Zeynep Akata · Dima Damen · Ivana Balazevic · Olivier J Henaff

[ ExHall D ]

Abstract
Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5\%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations surpass significantly more complex optimization-based adaptation schemes.
Poster
Yi Zhang · Yi-Xuan Deng · Meng-Hao Guo · Shi-Min Hu

[ ExHall D ]

Abstract
Vision-language models (VLMs) like CLIP have been widely used in various specific tasks.Parameter-efficient fine-tuning (PEFT) methods, such as prompt and adapter tuning,have become key techniques for adapting these models to specific domains.However, existing approaches rely on prior knowledgeto manually identify the locations requiring fine-tuning.Adaptively selecting which parameters in VLMs should be tuned remains unexplored. In this paper, we propose CLIP with Adaptive Selective Tuning (CLIP-AST), which can be used to automatically select critical parameters in VLMs for fine-tuning for specific tasks.It opportunely leveragesthe adaptive learning rate in the optimizer and improves model performance without extra parameter overhead. We conduct extensive experiments on 13 benchmarks, such as ImageNet, Food101, Flowers102, etc,with different settings, including few-shot learning, base-to-novel class generalization, and out-of-distribution. The results show that CLIP-AST consistently outperforms the original CLIP model as well as its variantsand achieves state-of-the-art (SOTA) performance in all cases. For example, with the 16-shot learning, CLIP-AST surpasses GraphAdapter and PromptSRC by 3.56\% and 2.20\% in average accuracy on 11 datasets, respectively.Code will be publicly available.
Poster
Yabin Wang · Zhiwu Huang · Xiaopeng Hong

[ ExHall D ]

Abstract
This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23\% in IoU (14.11\% in F1) and 2.05\% in accuracy (2.38\% in F1) compared to the second-best model in detection and localization tasks, respectively.
Poster
Jianyu LAI · Sixiang Chen · yunlong lin · Tian Ye · Yun Liu · Song Fei · Zhaohu Xing · Hongtao Wu · Weiming Wang · Lei Zhu

[ ExHall D ]

Abstract
Snowfall poses a significant challenge to visual data processing, requiring specialized desnowing algorithms. However, current models often struggle with generalization due to their reliance on synthetic datasets, creating a domain gap. Evaluating real snowfall images is difficult due to the lack of ground truth. To tackle these issues, we introduce a large-scale, high-quality dataset of 10,000 annotated real snow scenes, develop a dataset with 36k preference pairs based on human expert rankings, enhance multimodal large language models' perception of snowfall images using direct preference optimization (DPO), and refine desnowing models through a mean teacher semi-supervised framework with high-quality pseudo-label screening. This Framework substantially improves the generalization and performance of desnowing models on real snowfall images.
Poster
Meng Lou · Yizhou Yu

[ ExHall D ]

Abstract
In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel **Cont**ext-**Mix**ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing …
Poster
Damien Teney · Liangze Jiang · Florin Gogianu · Ehsan Abbasnejad

[ ExHall D ]

Abstract
Common choices of architecture give neural networks a preference for fitting data with simple functions. This simplicity bias is known as key to their success. This paper explores the limits of this assumption. Building on recent work that showed that activation functions are the origin of the simplicity bias (Teney, 2024), we introduce a method to meta-learn activation functions to modulate this bias.**Findings.** We discover multiple tasks where the assumption of simplicity is inadequate, and standard ReLU architectures are therefore suboptimal. In these cases, we find activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias proves adequate on image tasks, where learned activations are nearly identical to ReLUs and GeLUs.**Implications.** (1) Contrary to common belief, the simplicity bias is not universally useful. There exist real tasks where it is suboptimal. (2) The suitability of ReLU models for image classification is not accidental. (3) The success of ML ultimately depends on the adequacy between data and architectures, and there may be benefits for architectures tailored to specific distributions of …
Poster
Faridoun Mehri · Mahdieh Baghshah · Mohammad Taher Pilehvar

[ ExHall D ]

Abstract
Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad—a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods—including Transformer-specific approaches—across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models—two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available.
Poster
Kevin Miller · Aditya Gangrade · Samarth Mishra · Kate Saenko · Venkatesh Saligrama

[ ExHall D ]

Abstract
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and AND''/OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.
Poster
Haozhen Zhang · Zhaogeng Liu · Hualin Zhang · Xingchen Li · Wanli Shi · Bin Gu · Yi Chang

[ ExHall D ]

Abstract
Visual Prompt Learning (VPL) has emerged as a powerful strategy for harnessing the capabilities of large-scale pre-trained models (PTMs) to tackle specific downstream tasks. However, the opaque nature of PTMs in many real-world applications has led to a growing interest in gradient-free approaches within VPL. A significant challenge with existing black-box VPL methods lies in the high dimensionality of visual prompts, which necessitates considerable API queries for tuning, thereby impacting efficiency. To address this issue, we propose a novel query-efficient framework for black-box visual prompting, designed to generate input-dependent visual prompts efficiently for large-scale black-box PTMs. Our framework is built upon the insight of reparameterizing prompts using neural networks, improving the typical pre-training-fine-tuning paradigm through the subspace learning strategy to maximize efficiency and adaptability from both the perspective of initial weights and parameter dimensionality.This tuning intrinsically optimizes low-dimensional representations within the well-learned subspace, enabling the efficient adaptation of the network to downstream tasks.Our approach significantly reduces the necessity for substantial API queries to PTMs, presenting an efficient method for leveraging large-scale black-box PTMs in visual prompting tasks. Most experimental results across various benchmarks demonstrate the effectiveness of our method, showcasing substantial reductions in the number of required API queries to …
Poster
Xueyu Liu · Rui Wang · Yexin Lai · Guangze Shi · Feixue Shao · Fang Hao · Jianan Zhang · Jia Shen · Yongfei Wu · Wen Zheng

[ ExHall D ]

Abstract
Powered by extensive curated training data, the Segment Anything Model (SAM) demonstrates impressive generalization capabilities in open-world scenarios, effectively guided by user-provided prompts. However, the class-agnostic characteristic of SAM renders its segmentation accuracy highly dependent on prompt quality. In this paper, we propose a novel pplug-and-play dual-space Point Prompt Optimizer (PPO) designed to enhance prompt distribution through deep reinforcement learning (DRL)-based heterogeneous graph optimization. PPO optimizes initial prompts for any task without requiring additional training, thereby improving SAM’s downstream segmentation performance. Specifically, PPO constructs a dual-space heterogeneous graph, leveraging the robust feature-matching capabilities of a foundational pre-trained model to create internal feature and physical distance matrices. A DRL policy network iteratively refines the distribution of prompt points, optimizing segmentation predictions. We conducted experiments on four public datasets. The ablation study explores the necessity and balance of optimizing prompts in both feature and physical spaces. The comparative study shows that PPO enables SAM to surpass recent one-shot methods. Additionally, experiments with different initial prompts demonstrate PPO's generality across prompts generated by various methods. In conclusion, PPO redefines the prompt optimization problem as a heterogeneous graph optimization task, using DRL to construct an effective, plug-and-play prompt optimizer. This approach holds potential for …
Poster
Xueyi Ke · Satoshi Tsutsui · Yayun Zhang · Bihan Wen

[ ExHall D ]

Abstract
Infants develop complex visual understanding rapidly, even preceding of the acquisition of linguistic inputs. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We introduce a training-free framework that can discover visual concept neurons hidden in the model's internal representations. Our findings show that these neurons can classify objects outside its original vocabulary. Furthermore, we compare the visual representations in infant-like models with those in modern computer vision models, such as CLIP or ImageNet pre-trained model, highlighting key similarities and differences. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant's visual and linguistic inputs.
Poster
Li Ren · Chen Chen · Liqiang Wang · Kien A. Hua

[ ExHall D ]

Abstract
Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage \textit{metric learning} techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, \textbf{D}istribution \textbf{A}ware \textbf{V}isual \textbf{P}rompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks. The code will be released.
Poster
wenlong yu · Qilong Wang · Chuang Liu · Dong Li · Qinghua Hu

[ ExHall D ]

Abstract
Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Ultimately, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement …
Poster
Arpita Chowdhury · Dipanjyoti Paul · Zheda Mai · Jianyang Gu · Ziheng Zhang · Kazi Sajeed Mehrab · Elizabeth Campolongo · Daniel Rubenstein · Charles Stewart · Anuj Karpatne · Tanya Berger-Wolf · Yu Su · Wei-Lun Chao

[ ExHall D ]

Abstract
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that highlight visually similar categories' identities (e.g., different bird species or dog breeds). Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach **Prompt Class Attention Map(Prompt-CAM)** to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs.Extensive empirical studies on a dozen datasets …
Poster
Aaron Serianni · Tyler Zhu · Olga Russakovsky · Vikram V. Ramaswamy

[ ExHall D ]

Abstract
Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-loU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-loU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-loU reveals potential confounding variables not present in dataset labels.
Poster
Guangda Ji · Silvan Weder · Francis Engelmann · Marc Pollefeys · Hermann Blum

[ ExHall D ]

Abstract
Neural network performance scales with both model size and data volume, as shown in language and image processing. This requires scaling-friendly architectures and large datasets. While transformers have been adapted for 3D vision, a 'GPT-moment' remains elusive due to limited training data. We introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we enhance ARKitScenes with automatically generated dense labels using an extended LabelMaker pipeline, tailored for large-scale pre-training. Training on this dataset improves accuracy across architectures, achieving state-of-the-art results on ScanNet and ScanNet200, with notable gains on tail classes. We compare our results with self-supervised methods and synthetic data, evaluating the effects on downstream tasks and zero-shot generalization. The dataset will be publicly available.
Poster
Andrey Gizdov · Shimon Ullman · Daniel Harari

[ ExHall D ]

Abstract
Large multimodal models (LMMs) typically process visual inputs with uniform resolution across the entire field of view, leading to inefficiencies when non-critical image regions are processed as precisely as key areas. Inspired by the human visual system's foveated approach, we apply a biologically inspired method to leading architectures such as MDETR, BLIP2, InstructBLIP, LLaVA, and ViLT, and evaluate their performance with variable resolution inputs. Results show that foveated sampling boosts accuracy in visual tasks like question answering and object detection under tight pixel budgets, improving performance by up to 2.7% on the GQA dataset, 2.1% on SEED-Bench, and 2.0% on VQAv2 compared to uniform sampling. Furthermore, our research indicates that indiscriminate resolution increases yield diminishing returns, with models achieving up to 80% of their full capability using just 3% of the pixels, even on complex tasks. Foveated sampling also prompts more human-like processing within models, such as neuronal selectivity and globally-acting self-attention in vision transformers. This paper provides a foundational analysis of foveated sampling's impact on existing models, suggesting that more efficient architectural adaptations, mimicking human visual processing, are a promising research venue for the community.
Poster
Weiming Zhuang · Chen Chen · Zhizhong Li · Sina Sajadmanesh · Jingtao Li · Jiabo Huang · Vikash Sehwag · Vivek Sharma · Hirotaka Shinozaki · Felan Carlo Garcia · Yihao Zhan · Naohiro Adachi · Ryoji Eki · Michael Spranger · Peter Stone · Lingjuan Lyu

[ ExHall D ]

Abstract
While existing vision and multi-modal foundation models can handle multiple computer vision tasks, they often suffer from significant limitations, including huge demand for data and computational resources during training and inconsistent performance across vision tasks at deployment time. To address these challenges, we introduce Argus (The name comes from Argus Panoptes--a hundred-eyed giant with ''all-seeing'' capability in Greek mythology), a compact and versatile vision foundation model designed to support a wide range of vision tasks through a unified multitask architecture. Argus employs a two-stage training strategy: (i) multitask pretraining over core vision tasks with a shared backbone that includes a lightweight adapter to inject task-specific inductive biases, and (ii) scalable and efficient adaptation to new tasks by fine-tuning only the task-specific decoders. Extensive evaluations demonstrate that Argus, despite its relatively compact and training-efficient design of merely 100M backbone parameters (only 13.6\% of which are trained using 1.6M images), competes with and even surpasses much larger models. Compared to state-of-the-art foundation models, Argus not only covers a broader set of vision tasks but also matches or outperforms the models with similar sizes on 12 tasks. We expect that Argus will accelerate the real-world adoption of vision foundation models in resource-constrained scenarios.
Poster
Unki Park · Seongmoon Jeong · Jang Youngchan · Gyeong-Moon Park · Jong Hwan Ko

[ ExHall D ]

Abstract
The field of computer vision initially focused on human visual perception and has progressively expanded to encompass machine vision, with ongoing advancements in technology expected to drive further expansion. Consequently, image compressors must effectively respond not only to human visual perception but also to the current and future machine vision tasks. Towards this goal, this paper proposes a Test-Time Fine-Tuning (TTFT) approach for adapting Learned Image Compression (LIC) to multiple tasks. A large-scale LIC model, originally trained for human perception, is adapted to both closed-set and open-set machine tasks through TTFT using Singular Value Decomposition based Low Rank Adaptation (SVD-LoRA). The SVD-LoRA layers are applied to both the encoder and decoder of backbone LIC model, with a modified learning scheme employed for the decoder during TTFT to train only the singular values, preventing excessive bitstream overhead. This enables instance-specific optimization for the target task. Experimental results demonstrate that the proposed method effectively adapts the backbone compressor to diverse machine tasks, outperforming competing methods.
Poster
Sofia Casarin · Sergio Escalera · Oswald Lanz

[ ExHall D ]

Abstract
Training-free Neural Architecture Search (NAS) efficiently identifies high-performing neural networks using zero-cost (ZC) proxies. Unlike multi-shot and one-shot NAS approaches, ZC-NAS is both (i) time-efficient, eliminating the need for model training, and (ii) interpretable, with proxy designs often theoretically grounded. Despite rapid developments in the field, current SOTA ZC proxies are typically constrained to well-established convolutional search spaces. With the rise of Large Language Models shaping the future of deep learning, this work extends ZC proxy applicability to Vision Transformers (ViTs). We present a new benchmark using the Autoformer search space evaluated on 6 distinct tasks, and propose Layer-Sample Wise Activation with Gradients information (L-SWAG), a novel, generalizable metric that characterises both convolutional and transformer architectures across 14 tasks. Additionally, previous works highlighted how different proxies contain complementary information, motivating the need for a ML model to identify useful combinations. To further enhance ZC-NAS, we therefore introduce LIBRA-NAS (Low Information gain and Bias Re-Alignment), a method that strategically combines proxies to best represent a specific benchmark. Integrated into the NAS search, LIBRA-NAS outperforms evolution and gradient-based NAS techniques by identifying an architecture with a 17.0% test error on ImageNet1k in just 0.1 GPU days.
Poster
Zekang Yang · Wang ZENG · Sheng Jin · Chen Qian · Ping Luo · Wentao Liu

[ ExHall D ]

Abstract
Designing effective neural architectures poses a significant challenge in deep learning. While Neural Architecture Search (NAS) automates the search for optimal architectures, existing methods are often constrained by predetermined search spaces and may miss critical neural architectures. In this paper, we introduce NADER (Neural Architecture Design via multi-agEnt collaboRation), a novel framework that formulates neural architecture design (NAD) as a LLM-based multi-agent collaboration problem. NADER employs a team of specialized agents to enhance a base architecture through iterative modification. Current LLM-based NAD methods typically operate independently, lacking the ability to learn from past experiences, which results in repeated mistakes and inefficient exploration. To address this issue, we propose the Reflector, which effectively learns from immediate feedback and long-term experiences. Additionally, unlike previous LLM-based methods that use code to represent neural architectures, we utilize a graph-based representation. This approach allows agents to focus on design aspects without being distracted by coding. We demonstrate the effectiveness of NADER in discovering high-performing architectures beyond predetermined search spaces through extensive experiments on benchmark tasks, showcasing its advantages over state-of-the-art methods.
Poster
Minghao Fu · Hao Yu · Jie Shao · Junjie Zhou · Ke Zhu · Jianxin Wu

[ ExHall D ]

Abstract
Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization …
Poster
Hongjun Wang · Wonmin Byeon · Jiarui Xu · Jinwei Gu · Ka Chun Cheung · Jan Kautz · Xiaolong Wang · Kai Han · Sifei Liu

[ ExHall D ]

Abstract
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a unique line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to N, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84× when generating 16K images. Codes will be released upon publication.
Poster
Weihao Yu · Xinchao Wang

[ ExHall D ]

Abstract
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification on ImageNet does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks on COCO or ADE20K are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of …
Poster
Haoyang He · Jiangning Zhang · Yuxuan Cai · Hongxu Chen · Xiaobin Hu · Zhenye Gan · Yabiao Wang · Chengjie Wang · Yunsheng Wu · Lei Xie

[ ExHall D ]

Abstract
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction (MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba), Efficient Multi-Kernel Depthwise Deconvolution (MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum ×21↑ faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.
Poster
Hao Yu · Tangyu Jiang · Shuning Jia · Shannan Yan · Shunning Liu · Haolong Qian · Guanghao Li · Shuting Dong · Chun Yuan

[ ExHall D ]

Abstract
The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information.Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position.Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism.However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity.In this work, we propose **ComRoPE**, which generalizes **RoPE** by defining it in terms of trainable **com**muting angle matrices.Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation,which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset.Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available …
Poster
Yuwei Sun · Hideya Ochiai · Zhirong Wu · Stephen Lin · Ryota Kanai

[ ExHall D ]

Abstract
Emerging from the pairwise attention in conventional Transformers, there is a growing interest in sparse attention mechanisms that align more closely with localized, contextual learning in the biological brain. Existing studies such as the Coordination method employ iterative cross-attention mechanisms with a bottleneck to enable the sparse association of inputs. However, these methods are parameter inefficient and fail in more complex relational reasoning tasks. To this end, we propose Associative Transformer (AiT) to enhance the association among sparsely attended input tokens, improving parameter efficiency and performance in various classification and relational reasoning tasks. AiT leverages a learnable explicit memory comprising specialized priors that guide bottleneck attentions to facilitate the extraction of diverse localized tokens. Moreover, AiT employs an associative memory-based token reconstruction using a Hopfield energy function. The extensive empirical experiments demonstrate that AiT requires significantly fewer parameters and attention layers outperforming a broad range of sparse Transformer models. Additionally, AiT outperforms the SOTA sparse Transformer models including the Coordination method on the Sort-of-CLEVR dataset.
Poster
Jon Donnelly · Zhicheng Guo · Alina Jade Barnett · Hayden McTavish · Chaofan Chen · Cynthia Rudin

[ ExHall D ]

Abstract
Interpretability is critical for machine learning models in high-stakes settings because it allows users to verify the model's reasoning. In computer vision, prototypical part models (ProtoPNets) have become the dominant model type to meet this need.Users can easily identify flaws in ProtoPNets, but fixing problems in a ProtoPNet requires slow, difficult retraining that is not guaranteed to resolve the issue. This problem is called the "interaction bottleneck."We solve the interaction bottleneck for ProtoPNets by simultaneously finding many equally good ProtoPNets (i.e., a draw from a "Rashomon set"). We show that our framework -- called Proto-RSet -- quickly produces many accurate, diverse ProtoPNets, allowing users to correct problems in real time while maintaining performance guarantees with respect to the training set. We demonstrate the utility of this method in two settings: 1) removing synthetic bias introduced to a bird-identification model and 2) debugging a skin cancer identification model. This tool empowers non-machine-learning experts, such as clinicians or domain experts, to quickly refine and correct machine learning models _without_ repeated retraining by machine learning experts.
Poster
Xin Lin · Chong Shi · Zuopeng Yang · Haojin Tang · Zhili Zhou

[ ExHall D ]

Abstract
Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on GitHub.
Poster
Kiet A. Nguyen · Adheesh Juvekar · Tianjiao Yu · Muntasir Wahed · Ismini Lourentzou

[ ExHall D ]

Abstract
Recent advances in Large Vision-Language Models (LVLMs) advances have sparked significant progress in general-purpose vision tasks through visual instruction tuning. While some works have demonstrated the capability of LVLMs to generate segmentation masks that align phrases with natural language descriptions in a single image, they struggle with segmentation-grounded comparisons across multiple images, particularly at finer granularities such as object parts. In this paper, we introduce the new task of *part-focused semantic co-segmentation*, which seeks to identify and segment common and unique objects and parts across multiple images. To address this task, we present CALICO, the first LVLM that can segment and reason over multiple masks across images, enabling object comparison based on their constituent parts. CALICO features two novel components, a Correspondence Extraction Module, which captures semantic-rich information to identify part-level correspondences between objects, and a Correspondence Adaptation Module, which embeds this information into the LLM and facilitates multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MIXEDPARTS, a comprehensive multi-image segmentation dataset containing 2.4M samples across 44K images with diverse object and part categories. Experimental results show CALICO, finetuned on only 0.3\% of its architecture, achieves robust performance in part-focused semantic co-segmentation. Code, models, and …
Poster
Zelin Peng · Zhengqin Xu · Zhilin Zeng · Changsong Wen · Yu Huang · Menglin Yang · feilong tang · Wei Shen

[ ExHall D ]

Abstract
CLIP, a foundational vision-language model, has emerged as a powerful tool for open-vocabulary semantic segmentation. While freezing the text encoder preserves its powerful embeddings, recent studies show that fine-tuning both the text and image encoders jointly significantly enhances segmentation performance, especially for classes from open sets. In this work, we explain this phenomenon from the perspective of hierarchical alignment, since during fine-tuning, the hierarchy level of image embeddings shifts from image-level to pixel-level. We achieve this by leveraging hyperbolic space, which naturally encoders hierarchical structures. Our key observation is that, during fine-tuning, the hyperbolic radius of CLIP’s text embeddings decreases, facilitating better alignment with the pixel-level hierarchical structure of visual data. Building on this insight, we propose HyperCLIP, a novel fine-tuning strategy that adjusts the hyperbolic radius of the text embeddings through scaling transformations. By doing so, HyperCLIP equips CLIP with segmentation capability while introducing only a small number of learnable parameters. Our experiments demonstrate that HyperCLIP achieves state-of-the-art performance on open-vocabulary semantic segmentation tasks across three benchmarks, while fine-tuning only approximately 4\% of the total parameters of CLIP. More importantly, we observe that after adjustment, CLIP's text embeddings exhibit a relatively fixed hyperbolic radius across datasets, suggesting that the …
Poster
Pei Wang · Zhaowei Cai · Hao Yang · Ashwin Swaminathan · R. Manmatha · Stefano Soatto

[ ExHall D ]

Abstract
Traditional segmentation models, while effective in isolated tasks, often fail to generalize to more complex and open-ended segmentation problems, such as free-form, open-vocabulary, and in-the-wild scenarios. To bridge this gap, we propose to scale up image segmentation across diverse datasets and tasks such that the knowledge across different tasks and datasets can be integrated while improving the generalization ability. QueryMeldNet, a novel segmentation framework, is introduced and designed to scale seamlessly across both data size and task diversity. It is built upon a dynamic object query mechanism called query meld, which fuses different types of queries using cross-attention. This hybrid approach enables the model to balance between instance- and stuff-level segmentation, providing enhanced scalability for handling diverse object types. We further enhance scalability by leveraging synthetic data-generating segmentation masks and captions for pixel-level and open-vocabulary tasks-drastically reducing the need for costly human annotations. By training on multiple datasets and tasks at scale, QueryMeldNet continuously improves performance as the volume and diversity of data and tasks increase. It exhibits strong generalization capabilities, boosting performance in open-set segmentation tasks SeginW by 7 points. These advancements mark a key step toward universal, scalable segmentation models capable of addressing the demands of real-world applications.
Poster
Amin Karimi · Charalambos Poullis

[ ExHall D ]

Abstract
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-5i and COCO-20i demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios.
Poster
Yuchen Zhu · Cheng Shi · Dingyou Wang · Jiajin Tang · Zhengxuan Wei · Yu Wu · Guanbin Li · Sibei Yang

[ ExHall D ]

Abstract
Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring ''perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative ''visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available.
Poster
Seun-An Choe · Keon Hee Park · Jinwoo Choi · Gyeong-Moon Park

[ ExHall D ]

Abstract
Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled synthetic data (source) to unlabeled real-world data (target). Traditional UDA-SS methods work on the assumption that the category settings between the source and target domains are known in advance. However, in real-world scenarios, the category settings of the source and target are unknown due to the lack of labels in the target, resulting in the existence of target-private or source-private classes. Traditional UDA-SS methods struggle with this change, leading to negative transfer and performance degradation. To address these issues, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS) for the first time to achieve good performance even when the category settings of source and target are unknown. We defined the problem in the UniDA-SS scenario as that the confidence score of common classes in the target is lowered, which leads to confusion with private classes. To solve this problem, we propose two methods.The first method is Domain-Specific Prototype-based Distinction, which divides a class into two prototypes, each with domain-specific features, and then weights the common class based on the fact that if it is common, it will be similar to both prototypes. The second method is Target-based …
Poster
Yuhan Liu · Yixiong Zou · Yuhua Li · Ruixuan Li

[ ExHall D ]

Abstract
Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71\% and 5.34\% average MIoU in 1-shot and 5-shot scenarios, respectively.
Poster
Yan Yang · Liyuan Pan · Dongxu Li · Liu Liu

[ ExHall D ]

Abstract
This paper studies zero-shot object recognition using event camera data. Guided by CLIP, which is pre-trained on RGB images, existing approaches achieve zero-shot object recognition by optimizing embedding similarities between event data and RGB images respectively encoded by an event encoder and the CLIP image encoder. Alternatively, several methods learn RGB frame reconstructions from event data for the CLIP image encoder. However, they often result in suboptimal zero-shot performance.This study develops an event encoder without relying on additional reconstruction networks. We theoretically analyze the performance bottlenecks of previous approaches: the embedding optimization objectives are prone to suffer from the spatial sparsity of event data, causing semantic misalignments between the learned event embedding space and the CLIP text embedding space. To mitigate the issue, we explore a scalar-wise modulation strategy. Furthermore, to scale up the number of events and RGB data pairs for training, we also study a pipeline for synthesizing event data from static RGB images in mass.Experimentally, we demonstrate an attractive scaling property in the number of parameters and synthesized data. We achieve superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches. For example, our model with a ViT/B-16 backbone achieves …
Poster
Xianing Chen · Si Huo · Borui Jiang · Hailin Hu · Xinghao Chen

[ ExHall D ]

Abstract
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first domain generalization few-shot counter, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
Poster
Hao Tan · Zichang Tan · Jun Li · Ajian Liu · Jun Wan · Zhen Lei

[ ExHall D ]

Abstract
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce the model to refocus on its surrounding local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on NUS-WIDE, MS-COCO, RAPv1, PA100K, MultiScene and MLRSNet datasets from three distinct domains, and shows great potential to boost the existing methods. Code will be made available.
Poster
Dongseob Kim · Hyunjung Shim

[ ExHall D ]

Abstract
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code will be publicly available.
Poster
Phi Vu Tran

[ ExHall D ]

Abstract
Recent years have witnessed tremendous advances on modern visual recognition systems. Despite such progress, many vision models still struggle with the open problem of learning from few exemplars. This paper focuses on the task of object detection in the setting where object classes follow a natural long-tailed distribution. Existing approaches to long-tailed detection resort to external ImageNet labels to augment the low-shot training instances. However, such dependency on a large labeled database is impractical and has limited utility in realistic scenarios. We propose a more versatile approach to leverage optional unlabeled images, which are easy to collect without the burden of human annotations. Our framework is straightforward and intuitive, and consists of three simple steps: (1) pre-training on abundant head classes; (2) transfer learning on scarce tail classes; and (3) fine-tuning on a sampled set of both head and tail classes. Our approach can be viewed as an improved head-to-tail model transfer paradigm without the added complexities of meta-learning or knowledge distillation, as was required in past research. By harnessing supplementary unlabeled images, without the need for extra image labels, SimLTD establishes new record results on the challenging LVIS v1 benchmark across both supervised and semi-supervised settings.
Poster
Aming Wu · Cheng Deng

[ ExHall D ]

Abstract
To accelerate the safe deployment of object detectors, we focus on reducing the impact of both covariate and semantic shifts. And we consider a realistic yet challenging scenario, namely Open-Domain Unknown Object Detection (ODU-OD), which aims to detect unknown objects in unseen target domains without accessing any auxiliary data. Towards ODU-OD, it is feasible to learn a robust discriminative boundary by synthesizing virtual features. Generally, perception, memory, and imagination are three essential capacities for human beings. Through multi-level perception and rich memory about known objects, the characteristics of unknown objects can be imagined sufficiently, enhancing the ability of discriminating known from unknown objects. Inspired by this idea, an approach of World Feature Simulation (WFS) is proposed, mainly consisting of a multi-level perception, memory recorder, and unknown-feature generator. Specifically, after extracting the features of the input, we separately employ a Mamba and Graph Network to obtain the global-level and connective-level representations. Next, a codebook containing multiple learnable codewords is defined to preserve fragmented memory of known objects. Meanwhile, we perform a modulated operation on the memory to form the imagination bank involving unknown characteristics. Finally, to alleviate the impact of lacking supervision data, based on the multi-level representation and imagination bank, …
Poster
Marc-Antoine Lavoie · Anas Mahmoud · Steven L. Waslander

[ ExHall D ]

Abstract
The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and …
Poster
Zhixiong Nan · Xianghong Li · Tao Xiang · Jifeng Dai

[ ExHall D ]

Abstract
Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture. The cascaded decoder architecture constrains object queries to update in the cascaded direction, only enabling object queries to learn relatively-limited information from image features. However, the challenges for object detection in natural scenes (e.g., extremely-small, heavily-occluded, and confusingly mixed with the background) require an object detection model to fully utilize image features, which motivates us to propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. MI enables object queries to learn more comprehensive information, and our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark under different backbones and training epochs, achieving +2.3 AP and +0.6 AP improvements compared to the most representative model DINO and SOTA model Relation-DETR under ResNet-50 backbone. In addition, a series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
Poster
Huixin Sun · Runqi Wang · Yanjing Li · Linlin Yang · Shaohui Lin · Xianbin Cao · Baochang Zhang

[ ExHall D ]

Abstract
Deep learning has significantly advanced the object detection field. However, tiny object detection (TOD) remains a challenging problem. We provide a new analysis method to examine the TOD challenge through occlusion-based attribution analysis in the frequency domain. We observe that tiny objects become less distinct after feature encoding and can benefit from the removal of high-frequency information. In this paper, we propose a novel approach named Spectral Enhancement for Tiny object detection (SET), which amplifies the frequency signatures of tiny objects in a heterogeneous architecture. SET includes two modules. The Hierarchical Background Smoothing (HBS) module suppresses high-frequency noise in the background through adaptive smoothing operations. The Adversarial Perturbation Injection (API) module leverages adversarial perturbations to increase feature saliency in critical regions and prompt the refinement of object features during training. Extensive experiments on four datasets demonstrate the effectiveness of our method. Especially, SET boosts the prior art RFLA by 3.2\% AP on the AI-TOD dataset.
Poster
Wenxi Chen · Raymond A. Yeh · Shaoshuai Mou · Yan Gu

[ ExHall D ]

Abstract
Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for the safe deployment of deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than IND inputs. From this observation, we proposed a meta-score function that searches for local minimum scores near original inputs by applying gradient descent. This procedure enhances the separability between in-distribution (IND) and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures or training protocol. To validate our approach, we conduct extensive experiments using the OpenOOD benchmark. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10% on FPR@95 compared to state-of-the-art methods.
Poster
Kaichen Yang · Junjie Cao · Zeyu Bai · Zhixun Su · Andrea Tagliasacchi

[ ExHall D ]

Abstract
We introduce the Pose and Illumination agnostic Anomaly Detection (PIAD) problem, a generalization of pose-agnostic anomaly detection (PAD). Being illumination agnostic is critical, as it relaxes the assumption that training data for an object has to be acquired in the same light configuration of the query images that we want to test. Moreover, even if the object is placed within the same capture environment, being illumination agnostic implies that we can relax the assumption that the relative pose between environment light and query object has to match the one in the training data. We introduce a new dataset to study this problem, containing both synthetic and real-world examples, propose a new baseline for PIAD, and demonstrate how our baseline provides state-of-the-art results in both PAD and PIAD, not only in the new proposed dataset but also in existing datasets that were designed for the simpler PAD problem. Source code and data will be made publicly available upon paper acceptance.
Poster
wenxin ma · Xu Zhang · Qingsong Yao · Fenghe Tang · Chenxu Wu · Yingtai Li · Rui Yan · Zihang Jiang · S Kevin Zhou

[ ExHall D ]

Abstract
Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved by a straightforward and effective two-stage approach: it first creates anomaly-aware text anchors to clearly differentiate normal and abnormal semantics, then aligns patch-level visual features with these anchors for precise anomaly localization. AA-CLIP uses lightweight linear residual adapters to maintain CLIP's generalization and improves AD performance efficiently. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. (code is available in Supplementary Material)
Poster
Ziming Huang · Xurui Li · Haotian Liu · Feng Xue · Yuzhe Wang · Yu Zhou

[ ExHall D ]

Abstract
Recently, multi-class anomaly classification has garnered increasing attention.Previous methods directly cluster anomalies but often struggle due to the lack of anomaly-prior knowledge.Acquiring this knowledge faces two issues: the non-prominent and weak-semantics anomalies.In this paper,we propose AnomalyNCD,a multi-class anomaly classification network compatible with different anomaly detection methods.To address the non-prominence of anomalies,we design main element binarization (MEBin) to obtain anomaly-centered images,ensuring anomalies are learned while avoiding the impact of incorrect detections.Next, to learn anomalies with weak semantics,we design mask-guided representation learning,which focuses on isolated anomalies guided by masksand reduces confusion from erroneous inputs through re-corrected pseudo labels.Finally, to enable flexible classification at both region and image levels,we develop a region merging strategy that determines the overall image category based on the classified anomaly regions.Our method outperforms the state-of-the-art works on the MVTec AD and MTD datasets.Compared with the current methods,AnomalyNCD combined with zero-shot anomaly detection method achieves a 10.8\% F1 gain,8.8\% NMI gain,and 9.5\% ARI gain on MVTec AD,and 12.8\% F1 gain,5.7\% NMI gain,and 10.8\% ARI gain on MTD.The code will be publicly available.
Poster
Xiaofan Li · Xin Tan · Zhuo Chen · Zhizhong Zhang · Ruixin Zhang · Rizen Guo · Guannan Jiang · Yulong Chen · Yanyun Qu · Lizhuang Ma · Yuan Xie

[ ExHall D ]

Abstract
With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe faithfulness hallucination and catastrophic forgetting, which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of over-fitting to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA.
Poster
Shibin Mei · Hang Wang · Bingbing Ni

[ ExHall D ]

Abstract
Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time, to mine correlations between samples, aiming to address the limitations of common distance metric. Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning. Specifically, we construct a graph structure to represent the adjacency relationships among samples by thresholding distances between them and then apply the shortest-path algorithm to obtain geodesic distance within this graph. To facilitate efficient computation, we further propose a hierarchical graph structure through clustering and combined with incremental update strategies for dynamic status updates. Extensive experiments across various downstream tasks validate the effectiveness of our proposed method, demonstrating its capability to capture complex relationships between samples and improve the performance of multimodal learning models.
Poster
Seonggon Kim · Juncheol Shin · Seung-taek Woo · Eunhyeok Park

[ ExHall D ]

Abstract
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75\% memory savings and a 2.6× acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.
Poster
Zhiqiang Shen · Ammar Sherif · Zeyuan Yin · Shitong Shao

[ ExHall D ]

Abstract
Recent advances in dataset distillation have led to solutions in two main directions. The conventional batch-to-batch matching mechanism is ideal for small-scale datasets and includes bi-level optimization methods on models and syntheses, such as FRePo, RCIG, and RaT-BPTT, as well as other methods like distribution matching, gradient matching, and weight trajectory matching. Conversely, batch-to-global matching typifies decoupled methods, which are particularly advantageous for large-scale datasets. This approach has garnered substantial interest within the community, as seen in SRe2L, G-VBSM, WMDD, and CDA. A primary challenge with the second approach is the lack of diversity among syntheses within each class since samples are optimized independently and the same global supervision signals are reused across different synthetic images. In this study, we propose a new Diversity-driven EarlyLate Training (DELT) scheme to enhance the diversity of images in batch-to-global matching with less computation. Our approach is conceptually simple yet effective, it partitions predefined IPC samples into smaller subtasks and employs local optimizations to distill each subset into distributions from distinct phases, reducing the uniformity induced by the unified optimization process. These distilled images from the subtasks demonstrate effective generalization when applied to the entire task. We conduct extensive experiments on CIFAR, Tiny-ImageNet, ImageNet-1K, …
Poster
Jiamu Zhang · Shaochen (Henry) Zhong · Andrew Ye · Zirui Liu · Sebastian Zhao · Kaixiong Zhou · Li Li · Soo-Hyun Choi · Rui Chen · Xia Hu · Shuai Xu · Vipin Chaudhary

[ ExHall D ]

Abstract
Densely structured pruning methods — which generate pruned models in a fully dense format, allowing immediate compression benefits without additional demands — are evolving owing to their practical significance. Traditional techniques in this domain mainly revolve around coarser granularities, such as filter pruning, thereby limiting their performance due to restricted pruning freedom.Recent advancements in Grouped Kernel Pruning (GKP) have enabled the utilization of finer granularity while maintaining the densely structured format. We observed that existing GKP methods often introduce dynamic operations to different aspects of their procedures, where many were done so at the cost of adding complications and/or imposing limitations — e.g., requiring an expensive mixture of clustering schemes; or having dynamic pruning rates and sizes among groups, which lead to reliance on custom architecture support for its pruned models.In this work, we argue the best practice to introduce such dynamic operation to GKP is to make Conv2d(groups) (a.k.a. group count) flexible under an integral optimization, leveraging its ideal alignment with the infrastructure support of Grouped Convolution. Pursuing such direction, we present a one-shot, post-train, data-agnostic GKP method that is more performant, adaptive, and efficient than its predecessors; while simultaneously being a lot more user-friendly with little-to-no hyper-parameter tuning …
Poster
Fu Feng · Yucheng Xie · Jing Wang · Xin Geng

[ ExHall D ]

Abstract
The growing complexity of model parameters underscores the significance of pre-trained models; however, the constraints encountered during deployment necessitate the models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address initialization problems when target models are incompatible with pre-trained ones. We tackle this issue from a multitasking perspective and introduce \textbf{WAVE}, which incorporates a set of shared \textbf{W}eight templates for \textbf{A}daptive initialization of \textbf{V}ariable-siz\textbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their sizes, which are sufficient to learn the connection rules of weight templates based on the Kronecker Product from a limited amount of data.To construct the weight templates, WAVE utilizes the \textit{Learngene} framework, which condenses the size-agnostic knowledge from ancestry models into the weight templates as the learngenes through knowledge distillation. This process structures the pre-trained models' knowledge according to the rules of weight templates. We provide a comprehensive benchmark for learngenes, with extensive experiments demonstrating WAVE’s efficiency. Results show that WAVE achieves state-of-the-art performance when initializing models of various depth and width. With only 10 epochs of training, n variable-sized models initialized by WAVE outperform directly pre-trained models trained over 150 epochs, particularly for smaller models, achieving approximately …
Poster
Hongxu chen · Zhen Wang · Runshi Li · Bowei Zhu · Long Chen

[ ExHall D ]

Abstract
Low-rank adaptations (LoRA) are widely used to fine-tune large models across various domains for specific downstream tasks. While task-specific LoRAs are often available, concerns about data privacy and intellectual property can restrict access to training data, limiting the acquisition of a multi-task model through gradient-based training. In response, LoRA merging presents an effective solution by combining multiple LoRAs into a unified adapter while maintaining data privacy. Prior works on LoRA merging primarily frame it as an optimization problem, yet these approaches face several limitations, including the rough assumption about input features utilized in optimization, massive sample requirements, and the unbalanced optimization objective. These limitations can significantly degrade performance. To address these, we propose a novel optimization-based method, named IterIS: 1) We formulate LoRA merging as an advanced optimization problem to mitigate the rough assumption. Additionally, we employ an iterative inference-solving framework in our algorithm. It can progressively refine the optimization objective for improved performance. 2) We introduce an efficient regularization term to reduce the need for massive sample requirements (requiring only 1-5\% of the unlabeled samples compared to prior methods). 3) We utilize adaptive weights in the optimization objective to mitigate potential unbalances in LoRA merging process. Our method demonstrates …
Poster
Qiang Wang · Xiang Song · Yuhang He · Jizhou Han · Chenhao Ding · Xinyuan Gao · Yihong Gong

[ ExHall D ]

Abstract
Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continuous model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO’s consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The source code will be released to the community.
Poster
Yuhao Zhou · Yuxin Tian · Jindi Lv · Mingjia Shi · Yuanxi Li · Qing Ye · Shuhao Zhang · Jiancheng Lv

[ ExHall D ]

Abstract
In the realm of high-frequency data streams, achieving real-time learning within varying memory constraints is paramount. This paper presents Ferret, a comprehensive framework designed to enhance \emph{online accuracy} of Online Continual Learning (OCL) algorithms while dynamically adapting to varying memory budgets.Ferret employs a fine-grained pipeline parallelism strategy combined with an iterative gradient compensation algorithm, ensuring seamless handling of high-frequency data with minimal latency, and effectively counteracting the challenge of stale gradients in parallel training. To adapt to varying memory budgets, its automated model partitioning and pipeline planning optimizes performance regardless of memory limitations. Extensive experiments across 20 benchmarks and 5 integrated OCL algorithms show Ferret’s remarkable efficiency, achieving up to 3.7× lower memory overhead to reach the same online accuracy compared to competing methods.Furthermore, Ferret consistently outperforms these methods across diverse memory budgets, underscoring its superior adaptability. These findings position Ferret as a premier solution for efficient and adaptive OCL framework in real-time environments.
Poster
Xiaohan Zou · Wenchao Ma · Shu Zhao

[ ExHall D ]

Abstract
Recent advancements in prompt-based learning have significantly advanced image and video class-incremental learning. However, the prompts learned by these methods often fail to capture the diverse and informative characteristics of videos, and struggle to generalize effectively to future tasks and classes. To address these challenges, this paper proposes modeling the distribution of space-time prompts conditioned on the input video using a diffusion model. This generative approach allows the proposed model to naturally handle the diverse characteristics of videos, leading to more robust prompt learning and enhanced generalization capabilities. Additionally, we develop a mechanism that transfers the token relationship modeling capabilities of a pre-trained image transformer to spatio-temporal modeling for videos. Our approach has been thoroughly evaluated across four established benchmarks, showing remarkable improvements over existing state-of-the-art methods in video class-incremental learning.
Poster
Hao Yu · Xin Yang · Le Zhang · Hanlin Gu · Tianrui Li · Lixin Fan · Qiang Yang

[ ExHall D ]

Abstract
Federated continual learning (FCL) allows each client to continually update its knowledge from task streams, enhancing the applicability of federated learning in real-world scenarios. However, FCL needs to address not only spatial data heterogeneity between clients but also temporal data heterogeneity between tasks. In this paper, empirical experiments demonstrate that such input-level heterogeneity significantly affects the model's internal parameters and outputs, leading to severe spatial-temporal catastrophic forgetting of previous and local knowledge. To this end, we propose Federated Tail Anchor (FedTA) to mix trainable Tail Anchor with the frozen output features to adjust their position in the feature space, thereby overcoming parameter-forgetting and output-forgetting. Moreover, three novel components are also included in FedTA: Input Enhancement for improving the performance of pre-trained models on downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous local knowledge on the server side; and Best Global Prototype Selection for finding the best anchor point for each class in the feature space. Extensive experiments demonstrate that FedTA not only outperforms existing FCL methods but also effectively preserves the relative positions of features, remaining unaffected by spatial and temporal changes.
Poster
Takuma Fukuda · Hiroshi Kera · Kazuhiko Kawamoto

[ ExHall D ]

Abstract
We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an exemplar-free framework for class-incremental learning (CIL) that addresses both catastrophic forgetting and scalability. While existing methods trade-off between inference time and accuracy, ACMap consolidates task-specific adapters into a single adapter, ensuring constant inference time across tasks without compromising accuracy.The framework employs adapter merging to build a shared subspace that aligns task representations and mitigates forgetting, while centroid prototype mapping maintains high accuracy through consistent adaptation in the shared subspace. To further improve scalability, an early stopping strategy limits adapter merging as tasks increase. Extensive experiments on five benchmark datasets demonstrate that ACMap matches state-of-the-art accuracy while maintaining inference speed comparable to the fastest existing methods.
Poster
Guannan Lai · Yujie Li · Xiangkun Wang · Junbo Zhang · Tianrui Li · Xin Yang

[ ExHall D ]

Abstract
Class Incremental Learning (CIL) requires a model to continuously learn new classes without forgetting previously learned ones. While recent studies have significantly alleviated the problem of catastrophic forgetting (CF), more and more research reveals that the order in which classes appear have significant influences on CIL models. Specifically, prioritizing the learning of classes with lower similarity will enhance the model's generalization performance and its ability to mitigate forgetting.Hence, it is imperative to develop an order-robust class incremental learning model that maintains stable performance even when faced with varying levels of class similarity in different orders.In response, we first provide additional theoretical analysis, which reveals that when the similarity among a group of classes is lower, the model demonstrates increased robustness to the class order.Then, we introduce a novel **G**raph-**D**riven **D**ynamic **S**imilarity **G**rouping (**GDDSG**) method, which leverages a graph coloring algorithm for class-based similarity grouping. The proposed approach trains independent CIL models for each group of classes, ultimately combining these models to facilitate joint prediction.Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability.Our code is included in the supplementary materials.
Poster
Vaibhav Rathore · Shubhranil B · Saikat Dutta · Sarthak Mehrotra · Zsolt Kira · Biplab Banerjee

[ ExHall D ]

Abstract
Generalized Class Discovery (GCD) clusters base and novel classes in a target domain, using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain—with a distinct data distribution—remains unseen until inference. To this end, our solution, \ourmodel, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tunings on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base …
Poster
Yue Zhang · Mingyue Bin · Yuyang Zhang · Zhongyuan Wang · Zhen Han · Chao Liang

[ ExHall D ]

Abstract
Unsupervised domain adaptation (UDA) aims to learn discriminative features from a labeled source domain by supervised learning and to transfer the knowledge to an unlabeled target domain via distribution alignment. However, in some real-world scenarios, e.g., public safety or access control, it’s difficult to obtain sufficient source domain data, which hinders the application of existing UDA methods. To this end, this paper investigates a realistic but rarely studied problem called one-shot unsupervised domain adaptation (OSUDA), where there is only one example per category in the source domain and abundant unlabeled samples in the target domain. Compared with UDA, OSUDA faces dual challenges in both feature learning and domain alignment due to the lack of sufficient source data. To address these challenges, we propose a simple but effective link-based contrastive learning (LCL) method for OSUDA. On the one hand, with the help of in-domain links that indicate whether two samples are from the same cluster, LCL can learn discriminative features with abundant unlabeled target data. On the other hand, by constructing cross-domain links that show whether two clusters are bidirectionally matched, LCL can realize accurate domain alignment with only one source sample per category. Extensive experiments conducted on 4 public domain …
Poster
Weiming Liu · Jun Dan · Fan Wang · Xinting Liao · Junhao Dong · Hua Yu · Shunjie Dong · Lianyong Qi

[ ExHall D ]

Abstract
Nowadays, domain adaptation techniques have been widely investigated for knowledge sharing from labeled source domain to unlabeled target domain. However, target domain may include some data samples that belong to unknown categories in real-world scenarios. Moreover, the target domain cannot access the source data samples due to privacy-preserving restrictions. In this paper, we focus on the source-free open set domain adaptation problem which includes two main challenges, i.e.,how to distinguish known and unknown target samples and how to exploit useful source information to provide trustworthy pseudo labels for known target samples. Existing approaches that directly applying conventional domain alignment methods could lead to sample mismatch and misclassification in this scenario.To overcome these issues, we propose Distinguish Then Exploit model (DTE) with two components, i.e., weight barcode estimation and sparse label assignment. Weight barcode estimation first calculate marginal probability of target samples via partially unbalanced optimal transport then quantize barcode results to distinguish unknown target sample. Sparse label assignment utilizes sparse sample-label matching via proximal term to fully exploit useful source information. Our empirically study on several datasets shows that DTE outperforms the state-of-the-art models on tackling the source-free open set domain adaptation problem.
Poster
Shinnosuke Matsuo · Riku Togashi · Ryoma Bise · Seiichi Uchida · Masahiro Nomura

[ ExHall D ]

Abstract
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost.
Poster
Sk Miraj Ahmed · Umit Basaran · Dripta S. Raychaudhuri · Arindam Dutta · Rohit Kundu · Fahim Faisal Niloy · Basak Guler · Amit K. Roy-Chowdhury

[ ExHall D ]

Abstract
As machine learning become more pervasive and data privacy regulations evolve, the ability to remove private or copyrighted information from trained models is becoming an increasingly critical requirement. Existing unlearning methods often rely on the assumption of having access to the entire training dataset during the forgetting process. However, this assumption may not hold true in practical scenarios where the original training data may not be accessible, i.e., the source-free setting. To address this challenge, we focus on the source-free unlearning scenario, where an unlearning algorithm must be capable of removing specific data from a trained model without requiring access to the original training dataset. Building on recent work, we present a method that can estimate the Hessian of the unknown remaining training data, a crucial component required for efficient unlearning. Leveraging this estimation technique, our method enables efficient zero-shot unlearning while providing robust theoretical guarantees on the unlearning performance, while maintaining performance on the remaining data. Extensive experiments over a wide range of datasets verify the efficacy of our method.
Poster
Taero Kim · Subeen Park · Sungjun Lim · Yonghan Jung · Krikamol Muandet · Kyungwoo Song

[ ExHall D ]

Abstract
Learning robust models under distribution shifts between training and test datasets is a fundamental challenge in machine learning. While learning invariant features across environments is a popular approach, it often assumes that these features are fully observed in both training and test sets—a condition frequently violated in practice. When models rely on invariant features absent in the test set, their robustness in new environments can deteriorate. To tackle this problem, we introduce a novel learning principle called the Sufficient Invariant Learning (SIL) framework, which focuses on learning a sufficient subset of invariant features rather than relying on a single feature. After demonstrating the limitation of existing invariant learning methods, we propose a new algorithm, Adaptive Sharpness-aware Group Distributionally Robust Optimization (ASGDRO), to learn diverse invariant features by seeking common flat minima across the environments. We theoretically demonstrate that finding a common flat minima enables robust predictions based on diverse invariant features. Empirical evaluations on multiple datasets, including our new benchmark, confirm ASGDRO's robustness against distribution shifts, highlighting the limitations of existing methods.
Poster
Zhiwei Ling · Yachen Chang · Hailiang Zhao · Xinkui Zhao · Kingsum Chow · Shuiguang Deng

[ ExHall D ]

Abstract
Deep neural networks (DNNs) have been widely criticized for their overconfidence when dealing with out-of-distribution (OOD) samples, highlighting the critical need for effective OOD detection to ensure the safe deployment of DNNs in real-world settings. Existing post-hoc OOD detection methods primarily enhance the discriminative power of logit-based approaches by reshaping sample features, yet they often neglect critical information inherent in the features themselves. In this paper, we propose the C_lass-A_ware Re_lative F_eature-based method (CARef), which utilizes the error between a sample’s feature and its class-aware average feature as a discriminative criterion. To further refine this approach, we introduce the C_lass-A_ware D_ecoupled Re_lative F_eature-based method (CADRef), which decouples sample features based on the alignment of signs between the relative feature and corresponding model weights, enhancing the discriminative capabilities of CARef.Extensive experimental results across multiple datasets and models demonstrate that both proposed methods exhibit effectiveness and robustness in OOD detection compared to state-of-the-art methods. Specifically, our two methods outperform the best baseline by 2.82\% and 3.27\% in AUROC, with improvements of 4.03\% and 6.32\% in FPR95, respectively.
Poster
Zheng Wang · Zihui Wang · Zheng Wang · Xiaoliang Fan · Cheng Wang

[ ExHall D ]

Abstract
Federated learning (FL) is emerging as a promising technique for collaborative learning without local data leaving their devices. However, clients' data originating from diverse domains may degrade model performance due to domain shifts, preventing the model from learning consistent representation space. In this paper, we propose a novel FL framework, Federated Domain Shift Eraser (FDSE), to improve model performance by differently erasing each client's domain skew and enhancing their consensus. First, we formulate the model forward passing as an iterative deskewing process that extracts and then deskews features alternatively. This is efficiently achieved by decomposing each original layer in the neural network into a Domain-agnostic Feature Extractor (DFE) and a Domain-specific Skew Eraser (DSE). Then, a regularization term is applied to promise the effectiveness of feature deskewing by pulling local statistics of DSE's outputs close to the globally consistent ones. Finally, DFE modules are fairly aggregated and broadcast to all the clients to maximize their consensus, and DSE modules are personalized for each client via similarity-aware aggregation to erase their domain skew differently. Comprehensive experiments were conducted on three datasets to confirm the advantages of our method in terms of accuracy, efficiency, and generalizability.
Poster
Run He · Kai Tong · Di Fang · Han Sun · Ziqian Zeng · Haoran Li · Tianyi Chen · Huiping Zhuang

[ ExHall D ]

Abstract
In this paper, we introduce analytic federated learning (AFL), a new training paradigm that brings analytical (i.e., closed-form) solutions to the federated learning (FL) with pre-trained models. Our AFL draws inspiration from analytic learning---a gradient-free technique that trains neural networks with analytical solutions in one epoch. In the local client training stage, the AFL facilitates a one-epoch training, eliminating the necessity for multi-epoch updates. In the aggregation stage, we derive an absolute aggregation (AA) law. This AA law allows a single-round aggregation, reducing heavy communication overhead and achieving fast convergence by removing the need for multiple aggregation rounds. More importantly, the AFL exhibits a property that invariance to data partitioning, meaning that regardless of how the full dataset is distributed among clients, the aggregated result remains identical. This could spawn various potentials, such as data heterogeneity invariance and client-number invariance. We conduct experiments across various FL settings including extremely non-IID ones, and scenarios with a large number of clients (e.g., 1000). In all these settings, our AFL constantly performs competitively while existing FL techniques encounter various obstacles.
Poster
Naveen Kumar Kummari · Ranjeet Ranjan Jha · Krishna Mohan Chalavadi · Ravindra Babu Tallamraju

[ ExHall D ]

Abstract
Ensuring auditability and verifiability in FL is both challenging and essential to guarantee that local data remains untampered and client updates are trustworthy. Recent FL frameworks assess client contributions through a trusted central server using various client selection and aggregation techniques. However, reliance on a central server can create a single point of failure, making it vulnerable to privacy-centric attacks and limiting its ability to audit and verify client-side data contributions due to restricted access. In addition, data quality and fairness evaluations are often inadequate, failing to distinguish between high-impact contributions and those from low-quality or poisoned data. To address these challenges, we propose Federated Auditable and Verifiable Data valuation (FAVD), a privacy-preserving method that ensures auditability and verifiability of client contributions through data valuation, independent of any central authority or predefined training algorithm. FAVD utilizes shared local data density functions to construct a global density function, aligning data contributions and facilitating effective valuation prior to local model training. This proactive approach improves transparency in data valuation and ensures that only benign updates are generated, even in the presence of malicious data. Further, to mitigate privacy risks associated with sharing data density functions, we add Gaussian noise to each client’s …
Poster
Tae-Young Lee · Sundong Park · Minwoo Jeon · Hyoseok Hwang · Gyeong-Moon Park

[ ExHall D ]

Abstract
As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods.
Poster
Jiate Li · Meng Pang · Yun Dong · Binghui Wang

[ ExHall D ]

Abstract
Graph neural networks (GNNs) are becoming the de facto method to learn on the graph data and have achieved the state-of-the-art on node and graph classification tasks. However, recent works show GNNs are vulnerable to training-time poisoning attacks -- marginally perturbing edges, nodes, and node features of training graphs can largely degrade the GNN performance. Most previous defenses against such attacks are empirical and are soon broken by adaptive / stronger ones. A few provable defenses provide robustness guarantees, but have large gaps when applied in practice: 1) all restrict the attacker’s capability to only one type of perturbation; 2) all are designed for a particular GNN task; and 3) their robustness guarantees are not 100\% accurate. In this work, we bridge all these gaps by developing PGNNCert, the first certified defense of GNNs against poisoning attacks under arbitrary (edge, node, and node feature) perturbations with deterministic (100\% accurate) robustness guarantees. PGNNCert is also applicable to the two most widely-studied node and graph classification tasks. Extensive evaluations on multiple node and graph classification datasets and GNNs demonstrate the effectiveness of PGNNCert to provably defend against arbitrary poisoning perturbations. PGNNCert is also shown to significantly outperform the state-of-the-art certified defenses against …
Poster
Keke Tang · Chao Hou · Weilong Peng · Xiang Fang · Zhize Wu · Yongwei Nie · Wenping Wang · Zhihong Tian

[ ExHall D ]

Abstract
Deep neural networks (DNNs) often exhibit out-of-distribution (OOD) overconfidence, producing overly confident predictions on OOD samples. We attribute this issue to the inherent over-complexity of DNNs and investigate two key aspects: capacity and nonlinearity. First, we demonstrate that reducing model capacity through knowledge distillation can effectively mitigate OOD overconfidence. Second, we show that selectively reducing nonlinearity by removing ReLU operations further alleviates the issue. Building on these findings, we present a practical guide to model simplification, combining both strategies to significantly reduce OOD overconfidence. Extensive experiments validate the effectiveness of this approach in mitigating OOD overconfidence and demonstrate its superiority over state-of-the-art methods. Additionally, our simplification strategies can be combined with existing OOD detection techniques to further enhance OOD detection performance. Codes will be made publicly available upon acceptance.
Poster
Ping Guo · Cheng Gong · Fei Liu · Xi Lin · Zhichao Lu · Qingfu Zhang · Zhenkun Wang

[ ExHall D ]

Abstract
Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions.
Poster
Banglong Liu · Niuniu Qi · Xia Zeng · Lydia Dehbi · Zhengfeng Yang

[ ExHall D ]

Abstract
Polynomial inequality proving is fundamental to many mathematical disciplines and finds wide applications in diverse fields. Current traditional algebraic methods are based on searching for a polynomial positive definite representation over a set of basis. However, these methods are limited by truncation degree.To address this issue, this paper proposes an approach based on reinforcement learning to find a Krivine-basis representation for proving polynomial inequalities.Specifically, we formulate the inequality proving problem as a linear programming (LP) problem and encode it as a basis selection problem using reinforcement learning (RL), achieving a non-negative Krivine basis. Moreover, a fast multivariate polynomial multiplication method based on Fast Fourier Transform (FFT) is deployed to enhance the efficiency of action space search. Furthermore, we have implemented a tool called APPIRL (Automated Proof of Polynomial Inequalities via Reinforcement Learning).Experimental evaluation on benchmark problems demonstrates the feasibility and effectiveness of our approach. In addition, APPIRL can successfully be applied to attack the maximum stable set problem.
Poster
HaiMing Xu · Qianqian Wang · Boyue Wang · Quanxue Gao

[ ExHall D ]

Abstract
Multi-view clustering, while effective in integrating information from diverse data sources, may lead to biased outcomes when sensitive attributes are involved. Despite the substantial progress in recent research, most existing methods suffer from limited interpretability and lack strong mathematical theoretical foundations. In this work, we propose a novel approach, \textbf{D}eep \textbf{F}air \textbf{M}ulti-\textbf{V}iew \textbf{C}lustering with \textbf{A}ttention \textbf{K}olmogorov-\textbf{A}rnold \textbf{N}etwork (DFMVC-AKAN), designed to generate fair clustering results while maintaining robust performance. DFMVC-AKAN integrates attention mechanisms with multi-view data reconstruction to enhance both clustering accuracy and fairness. The model introduces an attention mechanism and Kolmogorov-Arnold Networks (KAN), which together address the challenges of feature fusion and the influence of sensitive attributes in multi-view data. The attention mechanism enables the model to dynamically focus on the most relevant features across different views, while KAN provides a nonlinear feature representation capable of efficiently approximating arbitrary multivariate continuous functions, thereby capturing complex relationships and latent patterns within the data. Experimental results on four datasets containing sensitive attributes demonstrate that DFMVC-AKAN significantly improves fairness and clustering performance compared to state-of-the-art methods.
Poster
yuzhuo dai · Jiaqi Jin · Zhibin Dong · Siwei Wang · Xinwang Liu · En Zhu · Xihong Yang · Xinbiao Gan · Yu Feng

[ ExHall D ]

Abstract
In incomplete multi-view clustering (IMVC), missing data introduces noise, causing view prototypes to shift and exhibit inconsistent semantics across views. A feasible solution is to explore cross-view consistency in paired complete observations for imputation and alignment. However, existing paradigm is limited to instance- or cluster-level. The former emphasizes instances alignment across views, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cluster counterparts across views but ignores intra-cluster observations within views. Neither can construct a semantic space shared for all view data to learn consensus semantic representation. Meanwhile, excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. To this end, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). For consensus semantics learning, a consensus prototypes, linearly weighted by all available data, is introduced to promote consensus assignments between paired observations in a shared semantic space, called as prototype-based contrastive clustering. For cluster semantics enhancement, we design a heuristic graph clustering based on modularity metric for specific view to recover cluster structure with intra-cluster compactness and inter-cluster separation. Extensive experiments demonstrate, compared to existing state-of-the-art competitors, FreeCSL is better at achieving confident and robust …
Poster
Zijian Dong · Yilei Wu · Chongyao Chen · Yingtian Zou · Yichi Zhang · Juan Helen Zhou

[ ExHall D ]

Abstract
In representation learning, uniformity refers to the uniform feature distribution in the latent space (*i.e.*, unit hypersphere). Previous work has shown that improving uniformity contributes to the learning of under-represented classes. However, most of the previous work focused on classification; the representation space of imbalanced regression remains unexplored. Classification-based methods are not suitable for regression tasks because they cluster features into distinct groups without considering the continuous and ordered nature essential for regression. In a geometric aspect, we uniquely focus on ensuring uniformity in the latent space for imbalanced regression through two key losses: **enveloping** and **homogeneity**. The enveloping loss encourages the induced trace to uniformly occupy the surface of a hypersphere, while the homogeneity loss ensures smoothness, with representations evenly spaced at consistent intervals. Our method integrates these geometric principles into the data representations via a **S**urrogate-driven **R**epresentation **L**earning (**SRL**) framework. Experiments with real-world regression and operator learning tasks highlight the importance of uniformity in imbalanced regression and validate the efficacy of our geometry-based loss functions.
Poster
Shanglin Liu · Jianming Lv · Jingdan Kang · Huaidong Zhang · Zequan Liang · Shengfeng He

[ ExHall D ]

Abstract
Multimodal unsupervised domain adaptation leverages unlabeled data in the target domain to enhance multimodal systems continuously. While current state-of-the-art methods encourage interaction between sub-models of different modalities through pseudo-labeling and feature-level exchange, varying sample quality across modalities can lead to the propagation of inaccurate information, resulting in error accumulation. To address this, we propose Modal-Affinity Multimodal Domain Adaptation (MODfinity), a method that dynamically manages multimodal information flow through fine-grained control over teacher model selection, guiding information intertwining at both feature and label levels. By treating labels as an independent modality, MODfinity enables balanced performance assessment across modalities, employing a novel modal-affinity measurement to evaluate information quality. Additionally, we introduce a modal-affinity distillation technique to control sample-level information exchange, ensuring reliable multimodal interaction based on affinity evaluations within the feature space. Extensive experiments on three multimodal datasets demonstrate that our framework consistently outperforms state-of-the-art methods, particularly in high-noise environments.
Poster
Yingxue Xu · Fengtao ZHOU · Chenyu Zhao · Yihui Wang · Can Yang · Hao Chen

[ ExHall D ]

Abstract
The integration of multimodal data including pathology images and gene profiles is widely applied in precise survival prediction. Despite recent advances in multimodal survival models, collecting complete modalities for multimodal fusion still poses a significant challenge, hindering their application in clinical settings. Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. In the first stage, Unimodal Prompting (UniPro) distills the knowledge distribution of each modality, preparing for supplementing modality-specific knowledge of the missing modality in the subsequent stage. In the second stage, Multimodal Prompting (MultiPro) leverages available modalities as prompts for LLMs to infer the missing modality, which provides modality-common information. Simultaneously, the unimodal knowledge acquired in the first stage is injected into multimodal inference to compensate for the modality-specific knowledge of the missing modality. Extensive experiments covering various missing scenarios demonstrated the superiority of the proposed method. The code will be released upon acceptance.
Poster
Wei Li · jiawei jiang · Jie Wu · Kaihao Yu · Jianwei Zheng

[ ExHall D ]

Abstract
Interpretability and consistency have long been crucial factors in MRI reconstruction. While interpretability has been significantly innovated with the emerging deep unfolding networks, current solutions still suffer from inconsistency issues and produce inferior anatomical structures. Especially in out-of-distribution cases, e.g., when the acceleration rate (AR) varies, the generalization performance is often catastrophic. To counteract the dilemma, we propose an innovative Linear Mamba Operator (LMO) to ensure consistency and generalization, while still enjoying desirable interpretability. Theoretically, we argue that mapping between function spaces, rather than between signal instances, provides a solid foundation of high generalization. Technically, LMO achieves a good balance between global integration facilitated by a state space model that scans the whole function domain, and local integration engaged with an appealing property of continuous-discrete equivalence. On that basis, learning holistic features can be guaranteed, tapping the potential of maximizing data consistency. Quantitative and qualitative results demonstrate that LMO significantly outperforms other state-of-the-arts. More importantly, LMO is the unique model that, with AR changed, achieves retraining performance without retraining steps. Codes are attached and will be released on GitHub.
Poster
Xiao Wang · Fuling Wang · Yuehang Li · Qingchuan Ma · Shiao Wang · Bo Jiang · Jin Tang

[ ExHall D ]

Abstract
X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models' insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. All the source codes …
Poster
Ying Chen · Guoan Wang · Yuanfeng Ji · Yanjun Li · Jin Ye · Tianbin Li · Ming Hu · Rongshan Yu · Yu Qiao · Junjun He

[ ExHall D ]

Abstract
Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in various settings such as microscopy, diagnosis and clinical. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities, achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.
Poster
Ziyuan Yang · Yingyu Chen · Zhiwen Wang · Hongming Shan · Yang Chen · Yi Zhang

[ ExHall D ]

Abstract
Reducing radiation doses benefits patients, however, the resultant low-dose computed tomography (LDCT) images often suffer from clinically unacceptable noise and artifacts. While deep learning (DL) shows promise in LDCT reconstruction, it requires large-scale data collection from multiple clients, raising privacy concerns. Federated learning (FL) has been introduced to address these privacy concerns; however, current methods are typically tailored to specific scanning protocols, which limits their generalizability and makes them less effective for unseen protocols. To address these issues, we propose SCAN-PhysFed, a novel SCanning- and ANatomy-level personalized Physicis-Driven Federated learning paradigm for LDCT reconstruction. Since the noise distribution in LDCT data is closely tied to scanning protocols and anatomical structures being scanned, we design a dual-level physics-informed way to address these challenges. Specifically, we incorporate physical and anatomical prompts into our physics-informed hypernetworks to capture scanning- and anatomy-specific information, enabling dual-level physics-driven personalization of imaging features. These prompts are derived from the scanning protocol and the radiology report generated by a medical large language model (MLLM), respectively. Subsequently, client-specific decoders project these dual-level personalized imaging features back into the image domain. Besides, to tackle the challenge of unseen data, we introduce a novel protocol vector-quantization strategy (PVQS), which ensures consistent …
Poster
Shaohao Rui · Lingzhi Chen · Zhenyu Tang · Lilong Wang · Mianxin Liu · Shaoting Zhang · Xiaosong Wang

[ ExHall D ]

Abstract
Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28\%-14.47\% across six segmentation benchmarks and a consistent accuracy boost of 0.65\%-18.07\% in four individual image classification tasks.
Poster
Qinghe Ma · Jian Zhang · Zekun Li · Lei Qi · Qian Yu · Yinghuan Shi

[ ExHall D ]

Abstract
Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains.In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements.In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31\% on the Prostate dataset. Our code is available in the supplementary material.
Poster
Tassilo Wald · Constantin Ulrich · Stanislav Lukyanenko · Andrei Goncharov · Alberto Paderno · Maximilian Miller · Leander Maerkisch · Paul F Jaeger · Klaus Maier-Hein

[ ExHall D ]

Abstract
Self-Supervised Learning (SSL) presents an exciting opportunity to unlock the potential of vast, untapped clinical datasets, for various downstream applications that suffer from the scarcity of labeled data.While SSL has revolutionized fields like natural language processing and computer vision, its adoption in 3D medical image computing has been limited by three key pitfalls: Small pre-training dataset sizes, architectures inadequate for 3D medical image analysis, and insufficient evaluation practices. In this paper, we address these issues by i) leveraging a large-scale dataset of 39k 3D brain MRI volumes and ii) using a Residual Encoder U-Net architecture within the state-of-the-art nnU-Net framework. iii) A robust development framework, incorporating 5 development and 8 testing brain MRI segmentation datasets, allowed performance-driven design decisions to optimize the simple concept of Masked Auto Encoders (MAEs) for 3D CNNs. The resulting model not only surpasses previous SSL methods but also outperforms the strong nnU-Net baseline by an average of approximately 3 Dice points setting a new state-of-the-art.Our code and models are made available.
Poster
Feng Yu · Jiacheng Cao · Li Liu · Minghua Jiang

[ ExHall D ]

Abstract
Multimodal 3D segmentation involves a significant number of 3D convolution operations, which requires substantial computational resources and high-performance computing devices in MRI multimodal brain tumor segmentation. The key challenge in multimodal 3D segmentation is how to minimize network computational load while maintaining high accuracy. To address the issue, a novel lightweight parameter aggregation network (SuperLightNet) is proposed to realize the efficient encoder and decoder for the high accurate and low computation. A random multiview drop encoder is designed to learn the spatial structure of multimodal images through a random multi-view approach for solving the high computational time complexity that has arisen in recent years with methods relying on transformers and Mamba. A learnable residual skip decoder is designed to incorporate learnable residual and group skip weights for addressing the reduced computational efficiency caused by the use of overly heavy convolution and deconvolution decoders. Experimental results demonstrate that the proposed method achieves a leading reduction in parameter count by 95.59\%, the 96.78\% improvement in computational efficiency, the 96.86\% enhancement in memory access performance, and the average performance gain of 0.21\% on the BraTS2019 and BraTS2021 datasets in comparison with the state-of-the-art methods. Code is available at https://github.com/WTU1020-Medical-Segmentation/SuperLightNet.
Poster
Jiongtong Hu · Wei Zhuo · Jun Cheng · YINGYING LIU · Wufeng Xue · Dong Ni

[ ExHall D ]

Abstract
In clinical practice of echocardiography examinations, multiple planes containing the heart structures of different view are usually required in screening, diagnosis and treatment of cardiac disease. AI models for echocardiography have to be tailored for a specific plane due to the dramatic structure differences, thus resulting in repetition development and extra complexity. Effective solution for such a multi-plane segmentation (MPS) problem is highly demanded for medical images, yet has not been well investigated. In this paper, we propose a novel solution, EchoONE, for this problem with a SAM-based segmentation architecture, a prior-composable mask learning (PC-Mask) module for semantic-aware dense prompt generation, and a learnable CNN-branch with a simple yet effective local feature fusion and adaption (LFFA) module for SAM adapting. We extensively evaluated our method on multiple internal and external echocardiography datasets, and achieved consistently state-of-the-art performance for multi-source datasets with different heart planes. This is the first time that the MPS problem is solved in one model for echocardiography data. The code will be available at https://github.com/a2503/EchoONE.
Poster
Jinho Joo · Hyeseong Kim · Hyeyeon Won · Deukhee Lee · Taejoon Eo · Dosik Hwang

[ ExHall D ]

Abstract
This study introduces a novel zero-shot scan-specific self-supervised reconstruction method for magnetic resonance imaging (MRI) to reduce scan times. Conventional supervised reconstruction methods require large amounts of fully-sampled reference data, which is often impractical to obtain and can lead to artifacts by overly emphasizing learned patterns. Existing zero-shot scan-specific methods have attempted to overcome this data dependency but show limited performance due to insufficient utilization of k-space information and constraints derived from MRI forward model. To address these limitations, we introduce a framework utilizing all acquired k-space measurements for both network inputs and training targets. While this framework suffers from training instability, we resolve these challenges through three key components: an Attention-guided K-space Selective Mechanism (AKSM) that provides indirect constraints for non-sampled k-space points, Iteration-wise K-space Masking (IKM) that enhances training stability, and a robust sensitivity map estimation model utilizing cross-channel constraint that performs effectively even at high reduction factors.Experimental results on the FastMRI knee and brain datasets with reduction factors of 4 and 8 demonstrate that the proposed method achieves superior reconstruction quality and faster convergence compared to existing zero-shot scan-specific methods, making it suitable for practical clinical applications. Our code will be released on Github.
Poster
Xinxing Cheng · Tianyang Zhang · Wenqi Lu · Qingjie Meng · Alejandro F Frangi · Jinming Duan

[ ExHall D ]

Abstract
Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds.Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code will be made publicly available.
Poster
Federico Bolelli · Kevin Marchesini · Niels van Nistelrooij · Luca Lumetti · Vittorio Pipoli · Elisa Ficarra · Shankeeth Vinayahalingam · Costantino Grana

[ ExHall D ]

Abstract
Cone-beam computed tomography (CBCT) is a standard imaging modality in orofacial and dental practices, providing essential 3D volumetric imaging of anatomical structures, including jawbones, teeth, sinuses, and neurovascular canals. Accurately segmenting these structures is fundamental to numerous clinical applications, such as surgical planning and implant placement. However, manual segmentation of CBCT scans is time-intensive and requires expert input, creating a demand for automated solutions through deep learning. Effective development of such algorithms relies on access to large, well-annotated datasets, yet current datasets are often privately stored or limited in scope and considered structures, especially concerning 3D annotations. This paper proposes a comprehensive, publicly accessible CBCT dataset with voxel-level 3D annotations of 42 distinct classes corresponding to maxillofacial structures. We validate the dataset by benchmarking state-of-the-art neural network models, including convolutional, transformer-based, and hybrid Mamba-based architectures, to evaluate segmentation performance across complex anatomical regions. Our work also explores adaptations to the nnU-Net framework to optimize multi-class segmentation for maxillofacial anatomy. The proposed dataset provides a fundamental resource for advancing maxillofacial segmentation and supports future research in automated 3D image analysis in digital dentistry.

Oral Session 2A: 3D Computer Vision Fri 13 Jun 01:00 p.m.  

Oral
Bowen Wen · Matthew Trepte · Oluwaseun Joseph Aribido · Jan Kautz · Orazio Gallo · Stan Birchfield

[ Karl Dean Ballroom ]

Abstract
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce StereoAnything, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.
Oral
Ruicheng Wang · Sicheng Xu · Cassie Lee Dai · Jianfeng XIANG · Yu Deng · Xin Tong · Jiaolong Yang

[ Karl Dean Ballroom ]

Abstract
We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitates effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervision techniques that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view.
Oral
Haoyu Guo · He Zhu · Sida Peng · Haotong Lin · Yunzhi Yan · Tao Xie · Wenguan Wang · Xiaowei Zhou · Hujun Bao

[ Karl Dean Ballroom ]

Abstract
This paper aims to reconstruct the scene geometry from multi-view images with strong robustness and high quality. Previous learning-based methods incorporate neural networks into the multi-view stereo matching and have shown impressive reconstruction results. However, due to the reliance on matching across input images, they typically suffer from high GPU memory consumption and tend to fail in sparse view scenarios. To overcome this problem, we develop a new pipeline, named Murre, for multi-view geometry reconstruction of 3D scenes based on SfM-guided monocular depth estimation. For input images, Murre first recover the SfM point cloud that captures the global scene structure, and then use it to guide a conditional diffusion model to produce multi-view metric depth maps for the final TSDF fusion. By predicting the depth map from a single image, Murre bypasses the multi-view matching step and naturally resolves the issues of previous MVS-based methods. In addition, the diffusion-based model can easily leverage the powerful priors of 2D foundation models, achieving good generalization ability across diverse real-world scenes. To obtain multi-view consistent depth maps, our key design is providing effective guidance on the diffusion model through the SfM point cloud, which is a condensed form of multi-view information, highlighting the …
Oral
Zhenggang Tang · Yuchen Fan · Dilin Wang · Hongyu Xu · Rakesh Ranjan · Alexander G. Schwing · Zhicheng Yan

[ Karl Dean Ballroom ]

Abstract
Recent sparse view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.
Oral
Jianyuan Wang · Minghao Chen · Nikita Karaev · Andrea Vedaldi · Christian Rupprecht · David Novotny

[ Karl Dean Ballroom ]

Abstract
We present VGGN, a feed-forward neural network that infers directly all key 3D attributes of a scene, such as camera poses, point maps, depth maps, and 3D point tracks, from few or hundreds of its views. Unlike recent alternatives, VGGN does not need to use visual geometry optimization techniques to refine the results in post-processing, obtaining all quantities of interest directly. This approach is simple and more efficient, reconstructing hundreds of images in seconds. We train VGGN on a large number of publicly available datasets with 3D annotations and demonstrate its ability to achieve state-of-the-art results in multiple 3D tasks, including camera pose estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. This is a step forward in 3D computer vision, where models have been typically constrained to and specialized for single tasks. We extensively evaluate our method on unseen datasets to demonstrate its superior performance. We will release the code and trained model.
Oral
Weiyu Li · Jiarui Liu · Hongyu Yan · Rui Chen · Yixun Liang · Xuelin Chen · Ping Tan · Xiaoxiao Long

[ Karl Dean Ballroom ]

Abstract
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, self-occlusion, irregular mesh topologies, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling softwares. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we first introduce a robust data preprocessing pipeline that utilizes visibility check and winding mumber to maximize the use of existing 3D data. Leveraging this data, we employ a 3D-native DiT model that directly models the distribution of 3D data in latent space, generating coarse geometries with regular mesh topology in seconds. Subsequently, a normal-based geometry refiner enhances local surface details, which can be applied automatically or interactively with user input. Extensive experiments demonstrate that our method achieves high efficacy in producing superior quality 3D assets compared to existing methods.

Oral Session 2B: Human Motion Fri 13 Jun 01:00 p.m.  

Oral
Felix Taubner · Ruihang Zhang · Mathieu Tuli · David B. Lindell

[ ExHall A2 ]

Abstract
Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints — for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
Oral
Jacob Yeung · Andrew Luo · Gabriel Sarch · Margaret Marie Henderson · Deva Ramanan · Michael J. Tarr

[ ExHall A2 ]

Abstract
While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. This framework advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with …
Oral
Fangzhou Hong · Vladimir Guzov · Hyo Jin Kim · Yuting Ye · Richard Newcombe · Ziwei Liu · Lingni Ma

[ ExHall A2 ]

Abstract
As wearable devices become more prevalent, understanding the user's motion is crucial for improving contextual AI systems. We introduce EgoLM, a versatile framework designed for egocentric motion understanding using multi-modal data. EgoLM integrates the rich contextual information from egocentric videos and motion sensors afforded by wearable devices. It also combines dense supervision signals from motion and language, leveraging the vast knowledge encoded in pre-trained large language models (LLMs). EgoLM models the joint distribution of egocentric motions and natural language using LLMs, conditioned on observations from egocentric videos and motion sensors. It unifies a range of motion understanding tasks, including motion narration from video or motion data, as well as motion generation from text or sparse sensor data. Unique to wearable devices, it also enables a novel task to generate text descriptions from sparse sensors. Through extensive experiments, we validate the effectiveness of EgoLM in addressing the challenges of under-constrained egocentric motion learning, and demonstrate its capability as a generalist model through a variety of applications.
Oral
Yan Xia · Xiaowei Zhou · Etienne Vouga · Qixing Huang · Georgios Pavlakos

[ ExHall A2 ]

Abstract
In this paper, we introduce a method for reconstructing humans in 3D from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to generate pseudo ground truth data and implement a training procedure that iteratively refines these pseudo labels for improved accuracy. Compared to state-of-the-art methods in 3D human pose estimation, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. This result highlights the benefits of using a biomechanical skeleton with realistic degrees of freedom for robust pose estimation. Additionally, we show that previous models frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom leading to more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We will make all code, models and data publicly available upon publication.
Oral
Guénolé Fiche · Simon Leglaive · Xavier Alameda-Pineda · Francesc Moreno-Noguer

[ ExHall A2 ]

Abstract
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches. Code and trained models will be released upon acceptance.
Oral
Liang Pan · Zeshi Yang · Zhiyang Dou · Wenjia Wang · Buzhen Huang · Bo Dai · Taku Komura · Jingbo Wang

[ ExHall A2 ]

Abstract
The synthesis of realistic and physically plausible human-scene interaction animations presents a critical and complex challenge in computer vision and embodied AI. Recent advances primarily focus on developing specialized character controllers for individual interaction tasks, such as contacting and carrying, often overlooking the need to establish a unified policy for versatile skills. This limitation hinders the ability to generate high-quality motions across a variety of challenging human-scene interaction tasks that require the integration of multiple skills, e.g., walking to a chair and sitting down while carrying a box. To address this issue, we present TokenHSI, a unified controller designed to synthesize various types of human-scene interaction animations. The key innovation of our framework is the use of tokenized proprioception for the simulated character, combined with various task observations, complemented by a masking mechanism that enables the selection of tasks on demand. In addition, our unified policy network is equipped with flexible input size capabilities, enabling efficient adaptation of learned foundational skills to new environments and tasks. By introducing additional input tokens to the pre-trained policy, we can not only modify interaction targets but also integrate learned skills to address diverse challenges. Overall, our framework facilitates the generation of a wide …

Oral Session 2C: Temporal Modeling and Action Recognition Fri 13 Jun 01:00 p.m.  

Oral
Laurie Bose · Piotr Dudek · Jianing Chen

[ Davidson Ballroom ]

Abstract
This paper presents a novel approach for joint point-feature detection and tracking, specifically designed for Pixel Processor Array sensors (PPA). Instead of standard pixels, PPA sensors consists of thousands of "pixel-processors", enabling massive parallel computation of visual data at the point of light capture. Our approach performs all computation within these pixel-processors, meaning no raw image data need ever leave the sensor. Instead, sensor output can be reduced to merely the locations of tracked features, and the descriptors of newly initialized features, minimizing data transfer between sensor and external processing. To achieve this we store feature descriptors inside every pixel-processor, adjusting the layout of these descriptors every frame. The PPA's architecture enables us to compute the response of every stored descriptor in parallel. This "response map" is utilized for both detection and tracking of point-features across the pixel-processor array. This approach is very fast, our implementation upon the SCAMP-7 PPA prototype runs at over 3000 FPS (Frames Per Second), tracking point-features reliably even under violent motion. This is the first work performing point-feature detection and tracking entirely "in-pixel".
Oral
Anna Manasyan · Maximilian Seitzer · Filip Radovic · Georg Martius · Andrii Zadaianchuk

[ Davidson Ballroom ]

Abstract
Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.
Oral
SuBeen Lee · WonJun Moon · Hyun Seok Seong · Jae-Pil Heo

[ Davidson Ballroom ]

Abstract
Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances.A key challenge in FSAR is handling divergent narrative trajectories for precise video matching.While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching.Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility.Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across novel classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM.
Oral
Zhejun Zhang · Peter Karkus · Maximilian Igl · Wenhao Ding · Yuxiao Chen · Boris Ivanovic · Marco Pavone

[ Davidson Ballroom ]

Abstract
Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. Our code will be made publicly available.
Oral
Otto Brookes · Maksim Kukushkin · Majid Mirmehdi · Colleen Stephens · Paula Dieguez · Thurston Cleveland Hicks · Sorrel CZ Jones · Kevin C. Lee · Maureen S. McCarthy · Amelia C. Meier · NORMAND E. · Erin G. Wessling · Roman M. Wittig · Kevin Langergraber · Klaus Zuberbühler · Lukas Boesch · Thomas Schmid · Mimi Arandjelovic · Hjalmar S. Kühl · Tilo Burghardt

[ Davidson Ballroom ]

Abstract
Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42\% mAP for convolutional and +3.75\% mAP for transformer-based models. Finally, we provide an …
Oral
Yichen Xiao · Shuai Wang · Dehao Zhang · Wenjie Wei · Yimeng Shan · Xiaoli Liu · Yulin Jiang · Malu Zhang

[ Davidson Ballroom ]

Abstract
Transformers significantly raise the performance limits across various tasks, spurring research into integrating them into spiking neural networks. However, a notable performance gap remains between existing spiking Transformers and their artificial neural network counterparts. Here, we first analyze the reason for this gap and identify that the dot product ineffectively calculates similarity between spiking Queries (Q) and Keys (K). To address this challenge, we introduce an innovative α-XNOR similarity calculation method tailored for spike trains. α-XNOR similarity redefines the correlation of non-spike pairs as a specific value α, effectively overcoming the limitations of dot-product similarity caused by numerous non-spiking events. Additionally, considering the sparse nature of spike trains where spikes carry more information than non-spikes, the α-XNOR similarity correspondingly highlights the distinct importance of spikes over non-spikes. Extensive experiments demonstrate that our α-XNOR similarity significantly improves performance across different spiking Transformer architectures in various static and neuromorphic datasets. This is the first attempt to develop a spiking self-attention paradigm tailored for the binary characteristics of spike trains.

Invited Talk: Harry Shum

Keynote 1: Harry Shum




Poster Session 2 Fri 13 Jun 04:00 p.m.  

Poster
Kai Chen · Yunhao Gou · Runhui Huang · Zhili Liu · Daxin Tan · Jing Xu · Chunwei Wang · Yi Zhu · yihan zeng · Kuo Yang · Dingdong WANG · Kun Xiang · Haoyuan Li · Haoli Bai · Jianhua Han · Xiao-Hui Li · Weike Jin · Nian Xie · Yu Zhang · James Kwok · Hengshuang Zhao · Xiaodan Liang · Dit-Yan Yeung · Xiao Chen · Zhenguo Li · Wei Zhang · Qun Liu · Lanqing Hong · Lu Hou · Hang Xu

[ ExHall D ]

Abstract
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches.For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
Poster
Xiumei Xie · Zikai Huang · Wenhao Xu · Peng Xiao · Xuemiao Xu · Huaidong Zhang

[ ExHall D ]

Abstract
Singing is a vital form of human emotional expression and social interaction, distinguished from speech by its richer emotional nuances and freer expressive style. Thus, investigating 3D facial animation driven by singing holds significant research value. Our work focuses on 3D singing facial animation driven by mixed singing audio, and to the best of our knowledge, no prior studies have explored this area. Additionally, the absence of existing 3D singing datasets poses a considerable challenge. To address this, we collect a novel audiovisual dataset, ChorusHead which features synchronized mixed vocal audio and pseudo-3D flame motions for chorus singing. In addition, We propose a partner-aware 3D chorus head generation framework driven by mixed audio inputs. The proposed framework extracts emotional features from the background music and dependence between singers and models the head movement in a latent space from the Variational Autoencoder (VAE), enabling diverse interactive head animation generation. Extensive experimental results demonstrate that our approach effectively generates 3D facial animations of interacting singers, achieving notable improvements in realism and handling background music interference with strong robustness.
Poster
Antoni Bigata Casademunt · Michał Stypułkowski · Rodrigo Mira · Stella Bounareli · Konstantinos Vougioukas · Zoe Landgraf · Nikita Drobyshev · Maciej Zieba · Stavros Petridis · Maja Pantic

[ ExHall D ]

Abstract
Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose \textbf{KeyFace}, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.
Poster
Rang Meng · Xingyu Zhang · Yuming Li · Chenguang Ma

[ ExHall D ]

Abstract
Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality. However, these methods often face practical challenges due to extra control conditions, cumbersome condition injection modules, or limitation to head region driving. Hence, we ask if it is possible to achieve striking half-body human animation while simplifying unnecessary conditions. To this end, we propose a half-body human animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance half-body details, facial and gestural expressiveness, and meanwhile reduce conditions redundancy. To compensate for the scarcity of half-body data, we utilize Head Partial Attention to seamlessly accommodate headshot data into our training framework, which can be omitted during inference, providing a free lunch for animation. Furthermore, we design the Phase-specific Denoising Loss to guide motion, detail, and low-level quality for animation in specific phases, respectively. Besides, we also present a novel benchmark for evaluating the effectiveness of half-body human animation. Extensive experiments and analyses demonstrate that EchoMimicV2 surpasses existing methods in both quantitative and qualitative evaluations. The source code and test dataset will be open-sourced.
Poster
Di Chang · Hongyi Xu · You Xie · Yipeng Gao · Zhengfei Kuang · Shengqu Cai · Chenxu Zhang · Guoxian Song · Chao Wang · Yichun Shi · Zeyuan Chen · Shijie Zhou · Linjie Luo · Gordon Wetzstein · Mohammad Soleymani

[ ExHall D ]

Abstract
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key factors underlying the loss of dynamic details, enhancing the lifelike qualities of human video animations.At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos.Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations.
Poster
Yiqun Mei · Mingming He · Li Ma · Julien Philip · Wenqi Xian · David M George · Xueming Yu · Gabriel Dedic · Ahmet Levent Taşel · Ning Yu · Vishal M. Patel · Paul Debevec

[ ExHall D ]

Abstract
Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable.This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
Poster
Shengjie Gong · Haojie Li · Jiapeng Tang · Dongming Hu · Shuangping Huang · Hao Chen · Tianshui Chen · Zhuoman Liu

[ ExHall D ]

Abstract
In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.
Poster
Jiawei Zhang · Zijian Wu · Zhiyang Liang · Yicheng Gong · Dongfang Hu · Yao Yao · Xun Cao · Hao Zhu

[ ExHall D ]

Abstract
Reconstructing high-fidelity, animatable 3D head avatars from effortlessly captured monocular videos is a pivotal yet formidable challenge. Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE — a novel method for reconstructing an editable full-head avatar from a single monocular video. FATE integrates a sampling-based densification strategy to ensure optimal positional distribution of points, improving rendering efficiency. A neural baking technique is introduced to convert discrete Gaussian representations into continuous attribute maps, facilitating intuitive appearance editing. Furthermore, we propose a universal completion framework to recover non-frontal appearance, culminating in a 360-renderable 3D head avatar. FATE outperforms previous approaches in both qualitative and quantitative evaluations, achieving state-of-the-art performance. To the best of our knowledge, FATE is the first animatable and 360 full-head monocular reconstruction method for a 3D head avatar. The code will be publicly released upon publication.
Poster
Felix Taubner · Ruihang Zhang · Mathieu Tuli · David B. Lindell

[ ExHall D ]

Abstract
Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints — for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
Poster
Jiapeng Tang · Davide Davoli · Tobias Kirschstein · Liam Schoneveld · Matthias Nießner

[ ExHall D ]

Abstract
We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing thatGAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34\% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.
Poster
Chen Guo · Junxuan Li · Yash Kant · Yaser Sheikh · Shunsuke Saito · Chen Cao

[ ExHall D ]

Abstract
We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of …
Poster
Yufan Wu · Xuanhong Chen · Wen Li · Shunran Jia · Hualiang Wei · Kairui Feng · Jialiang CHEN · Yuhan Li · Ang He · Weimin Zhang · Bingbing Ni · Wenjun Zhang

[ ExHall D ]

Abstract
Despite significant advances in accurately estimating geometry in contemporary single-image 3D human reconstruction, creating a high-quality, efficient, and animatable 3D avatar remains an open challenge. Two key obstacles persist: incomplete observation and inconsistent 3D priors.To address these challenges, we propose \textbf{SinG}, aiming to achieve high-quality and efficient animatable 3D avatar reconstruction.At the heart of SinG are two key components: Kinematic Human Diffusion (KHD) and compact 3D distillation. The former is a foundational human model that samples within pose space to generate a highly 3D-consistent and high-quality sequence of human images, inferring unseen viewpoints and providing kinematic priors. The latter is a system that reconstructs a compact, high-quality 3D avatar even under imperfect priors, achieved through a novel regional Laplacian that enables precise and compact assembly of 3D primitives.Extensive experiments demonstrate that SinG enables lifelike, animatable human reconstructions, maintaining both high quality and inference efficiency.
Poster
Suzhen Wang · Weijie Chen · Wei Zhang · Minda Zhao · Lincheng Li · Rongsheng Zhang · Zhipeng Hu · Xin Yu

[ ExHall D ]

Abstract
Character customization, or face crafting,'' is a vital feature in role-playing games (RPGs) that enhances player engagement by allowing the creation of personalized avatars. Existing automated methods often face challenges in maintaining style consistency and adaptability across different game engines due to their reliance on specific image domains. We introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by integrating both text and image inputs. Our approach utilizes a translator capable of converting facial images of any style into crafting parameters. To develop this translator, we first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset. Subsequently, we learn the mapping from the feature distribution to the crafting parameters specific to a game engine. By integrating text-to-image techniques with our translator, EasyCraft also supports precise, text-based character crafting. EasyCraft's integration of diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.
Poster
Yuxin Yao · Zhi Deng · Junhui Hou

[ ExHall D ]

Abstract
This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and reposable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering. The source code will be publicly available.
Poster
Yuxiang Mao · Zhenfeng Fan · Zhijie Zhang · Zhiheng Zhang · Shihong Xia

[ ExHall D ]

Abstract
Training a generic 3D face reconstruction model in a self-supervised manner using large-scale, in-the-wild 2D face image datasets enhances robustness to varying lighting conditions and occlusions while allowing the model to capture animatable wrinkle details across diverse facial expressions. However, a generic model often fails to adequately represent the unique characteristics of specific individuals. In this paper, we propose a method to train a generic base model and then transfer it to yield person-specific models by integrating lightweight adapters within the large-parameter ViT-MAE base model. These person-specific models excel at capturing individual facial shapes and detailed features while preserving the robustness and prior knowledge of detail variations from the base model. During training, we introduce a silhouette vertex re-projection loss to address boundary "landmark marching" issues on the 3D face caused by pose variations. Additionally, we employ an innovative teacher-student loss to leverage the inherent strengths of UNet in feature boundary localization for training our detail MAE. Quantitative and qualitative experiments demonstrate that our approach achieves state-of-the-art performance in face alignment, detail accuracy, and richness. The code will be released to the public upon the acceptance of this paper.
Poster
Wooseok Jang · Youngjun Hong · Geonho Cha · Seungryong Kim

[ ExHall D ]

Abstract
Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also referred to as face rigging is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. ControlFace employs a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet’s rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace’s superior performance in identity preservation, and control precision, highlighting its practicality. Code and pre-trained weights will be publicly available.
Poster
Yifang Xu · BenXiang Zhai · Yunzhuo Sun · Ming Li · Yang Li · Sidan Du

[ ExHall D ]

Abstract
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents Method, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training Method. Extensive experimental results demonstrate that our method surpasses the state-of-the-art approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works. Our code is available at -.
Poster
Hyeongjin Nam · Donghwan Kim · Jeongtaek Oh · Kyoung Mu Lee

[ ExHall D ]

Abstract
Most existing methods of 3D clothed human reconstruction from a single image consider the clothed human as a single object without separating them. In this work, we present DeClotH, a novel framework that separately reconstructs 3D cloth and human body from a single image. This task has not yet been explored because of the extreme occlusion between cloth and the human body, which makes it complicated to infer the overall geometries and textures. Furthermore, while recent 3D human reconstruction methods show impressive results by leveraging text-to-image diffusion models, naively adopting such a strategy to this problem often provides incorrect guidance, especially in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of both cloth and human body as regularizations, which provide strong priors and prevent erroneous reconstruction by the occlusion. Second, we leverage a cloth diffusion model specifically devised to provide contextual information about cloth appearances for reconstructing 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approaches are highly effective in reconstructing 3D cloth and the human body. The codes will be released.
Poster
Tengfei Xiao · Yue Wu · Yuelong Li · Can Qin · Maoguo Gong · Qiguang Miao · Wenping Ma

[ ExHall D ]

Abstract
Human pose generation is a complex task due to the non-rigid and highly variable nature of human body structures and appearances. While existing methods overlook the fundamental differences between spatial transformations of poses and texture generation for appearance, making them prone to overfitting. To this end, a multi-pose generation framework is proposed which is driven by disentangled pose and appearance guidance. Our approach comprises a Global-aware Pose Generation module that iteratively generates pose embeddings, enabling effective control over non-rigid body deformations. Additionally, we introduce the Global-aware Transformer Decoder, which leverages similarity queries and attention mechanisms to achieve spatial transformations and enhance pose consistency through a Global-aware block. In the appearance generation phase, we condition a diffusion model on pose embeddings produced in the initial stage and introduce an appearance adapter that extracts high-level contextual semantic information from multi-scale features, enabling further refinement of pose appearance textures and providing appearance guidance. Extensive experiments on the UBC Fashion and TikTok datasets demonstrate that our framework achieves state-of-the-art results in both quality and fidelity, establishing it as a powerful approach for complex pose generation tasks.
Poster
Yuanbo Wang · Zhaoxuan Zhang · Jiajin Qiu · Dilong Sun · Zhengyu Meng · Xiaopeng Wei · Xin Yang

[ ExHall D ]

Abstract
Diffusion models have made breakthroughs in 3D generation tasks in recent years. However, current 3D diffusion models focus on reconstructing target shape from images or a set of partial observations. While excelling in global context understanding, they struggle to capture the local details of complex shapes and limited to the occlusion and lighting conditions. To overcome these limitations, we utilize tactile images to capture the local 3D shape information and propose a Touch2Shape model, which leverages a touch-conditioned diffusion model to explore and reconstruct the target shape from touch. For shape reconstruction, we have developed a touch embedding module to condition the diffusion model in creating a compact representation and a touch shape fusion module to refine the reconstructed shape. For shape exploration, we combine the diffusion model with reinforcement learning to train a shape exploration policy. This involves using the generated latent vector from the diffusion model to guide the touch exploration policy training through a novel reward design. Experiments validate that our method significantly improve the reconstruction quality thorough both qualitatively and quantitative analysis, and our touch exploration policy further boosts reconstruction performance.
Poster
Zhiheng Liu · Ka Leong Cheng · Xi Chen · Jie Xiao · Hao Ouyang · Kai Zhu · Yu Liu · Yujun Shen · Qifeng Chen · Ping Luo

[ ExHall D ]

Abstract
Derived from diffusion models, MangaNinja specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases (*e.g.*, extreme poses and shadows), cross-character colorization, multi-reference harmonization, *etc.*, beyond the reach of existing algorithms.
Poster
Qingsen Yan · Yixu Feng · Cheng Zhang · Guansong Pang · Kangbiao Shi · Peng Wu · Wei Dong · Jinqiu Sun · Yanning Zhang

[ ExHall D ]

Abstract
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts.To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets.
Poster
Tianfu Wang · Mingyang Xie · Haoming Cai · Sachin Shah · Christopher Metzler

[ ExHall D ]

Abstract
Transparent surfaces, such as glass, create complex reflections that obscure images, posing challenges for various computer vision applications. We present Flash-Split, a robust, two-stage flash-based approach for separating transmitted and reflected light using a single pair of flash/no-flash images, even if they are misaligned. Stage 1 performs separation in latent space via a dual-branch diffusion model, reducing the need for alignment by encoding both transmission and reflection with cross-attention. Stage 2 enhances image sharpness through conditional decoding, blending separated low-frequency and original high-frequency information. Flash-Split achieves state-of-the-art results in single-view reflection separation, validated on synthetic and real-world data, and outperforms existing methods in challenging scenarios.
Poster
Feiran Li · Haiyang Jiang · Daisuke Iso

[ ExHall D ]

Abstract
Noise synthesis is a promising solution for addressing the data shortage problem in data-driven low-light RAW image denoising. However, accurate noise synthesis methods often necessitate labor-intensive calibration and profiling procedures during preparation, preventing them from landing to practice at scale. This work introduces a practically simple noise synthesis pipeline based on detailed analyses of noise properties and extensive justification of widespread techniques. Compared to other approaches, our proposed pipeline eliminates the cumbersome system gain calibration and signal-independent noise profiling steps, reducing the preparation time for noise synthesis from days to hours. Meanwhile, our method exhibits strong denoising performance, showing an up to 0.54dB PSNR improvement over the current state-of-the-art noise synthesis technique.
Poster
Hang Chen · Yin Xie · Xiaoxiu Peng · Lihu Sun · Wenkai Su · Xiaodong Yang · Chengming Liu

[ ExHall D ]

Abstract
Defocus deblurring is a challenging task due to the spatially varying blur. Recent works have shown impressive results in data-driven approaches using dual-pixel (DP) sensors. Quad-pixel (QP) sensors represent an advanced evolution of DP sensors, providing four distinct sub-aperture views in contrast to only two views offered by DP sensors. However, research on QP-based defocus deblurring is scarce. In this paper, we propose a novel end-to-end learning-based approach for defocus deblurring that leverages QP data. To achieve this, we design a QP defocus and all-in-focus image pair acquisition method and provide a QP Defocus Deblurring (QPDD) dataset containing 4,935 image pairs. We then introduce a Local-gate assisted Mamba Network (LMNet), which includes a two-branch encoder and a Simple Fusion Module (SFM) to fully utilize features of sub-aperture views. In particular, our LMNet incorporates a Local-gate assisted Mamba Block (LAMB) that mitigates local pixel forgetting and channel redundancy within Mamba, and effectively captures global and local dependencies. By extending the defocus deblurring task from a DP-based to a QP-based approach, we demonstrate significant improvements in restoring sharp images. Comprehensive experimental evaluations further indicate that our approach outperforms state-of-the-art methods.
Poster
Jun Myeong Choi · Annie N. Wang · Pieter Peers · Anand Bhattad · Roni Sengupta

[ ExHall D ]

Abstract
Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (eg, turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.
Poster
Xiulong Liu · Anurag Kumar · Paul Calamia · Sebastia Vicenc Amengual Gari · Calvin Murdock · Ishwarya Ananthabhotla · Philip W Robinson · Eli Shlizerman · Vamsi Krishna Ithapu · Ruohan Gao

[ ExHall D ]

Abstract
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce AcousticRooms, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
Poster
Tao Xie · Xi Chen · Zhen Xu · Yiman Xie · Yudong Jin · Yujun Shen · Sida Peng · Hujun Bao · Xiaowei Zhou

[ ExHall D ]

Abstract
Reconstructing complex reflections in real-world scenes from 2D images is essential for achieving photorealistic novel view synthesis. Existing methods that utilize environment maps to model reflections from distant lighting often struggle with high-frequency reflection details and fail to account for near-field reflections. In this work, we introduce RefGS, a novel environment representation that employs a set of Gaussian primitives as an explicit 3D model for capturing reflections. Our approach models all reflections, regardless of their distance, into a unified set of Gaussian primitives, effectively representing high-frequency reflections from both near and distant light sources. To efficiently render these environment Gaussians, we developed a ray-tracing-based renderer that leverages the GPU's RT core for fast rendering. This allows us to jointly optimize our model for high-quality reconstruction while maintaining real-time rendering speeds. Results from multiple real-world and synthetic datasets demonstrate that our method produces significantly more detailed reflections, achieving the best rendering quality in real-time novel view synthesis. The code will be released upon acceptance.
Poster
Kaiwen Jiang · Venkataram Sivaram · Cheng Peng · Ravi Ramamoorthi

[ ExHall D ]

Abstract
Geometric reconstruction of opaque surfaces from images is a longstanding challenge in computer vision, with renewed interest from volumetric view synthesis algorithms using radiance fields. We leverage the geometry field proposed in recent work for stochastic opaque surfaces, which can then be converted to volume densities. We adapt Gaussian kernels or surfels to splat the geometry field rather than the volume, enabling precise reconstruction of opaque solids. Our first contribution is to derive an efficient and almost exact differentiable rendering algorithm for geometry fields parameterized by Gaussian surfels, while removing current approximations involving Taylor series and no self-attenuation. Next, we address the discontinuous loss landscape when surfels cluster near geometry, showing how to guarantee that the rendered color is a continuous function of the colors of the kernels, irrespective of ordering. Finally, we use latent representations with spherical harmonics encoded reflection vectors rather than spherical harmonics encoded colors to better address specular surfaces. We demonstrate significant improvement in the quality of reconstructed 3D surfaces on widely-used datasets.
Poster
Ishit Mehta · Manmohan Chandraker · Ravi Ramamoorthi

[ ExHall D ]

Abstract
Problems in differentiable rendering often involve optimizing scene parameters that cause motion in image space. The gradients for such parameters tend to be sparse, leading to poor convergence. While existing methods address this sparsity through proxy gradients such as topological derivatives or lagrangian derivatives, they make simplifying assumptions about rendering. Multi-resolution image pyramids offer an alternative approach but prove unreliable in practice. We introduce a method that uses locally orderless images --- where each pixel maps to a histogram of intensities that preserves local variations in appearance. Using an inverse rendering objective that minimizes histogram distance, our method extends support for sparsely defined image gradients and recovers optimal parameters. We validate our method on various inverse problems using both synthetic and real data.
Poster
JunYong Choi · Min-Cheol Sagong · SeokYeong Lee · Seung-Won Jung · Ig-Jae Kim · Junghyun Cho

[ ExHall D ]

Abstract
We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, the objectives of finding a single accurate solution and generating diverse possible solutions can often be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. Additionally, we introduce a technique to handle high-dimensional per-pixel environment maps in diffusion models. The experimental results on both synthetic and real-world datasets demonstrate the superiority of our two models and their complementary nature, highlighting the importance of considering both accuracy and diversity in inverse rendering.
Poster
Moritz Heep · Sven Behnke · Eduard Zell

[ ExHall D ]

Abstract
High-resolution normal maps, e.g. from photometric stereo, capture astonishing surface details. Since normals are estimated pixel-by-pixel, the subsequent normal integration is typically performed on the regular pixel grid. Consequently, resolving structures half the size requires doubling both image height and width.To avoid this quadratic growth with increasing image size, recent work has suggested approximating the pixel grid with lower-resolution 2D triangle meshes and integrating the triangle mesh afterwards. In our work, we analyse the local shape structure to increase the compression ratio and precision of 2D meshes. Previous work generated only isotropic meshes with equilateral triangles. Instead, our mesh decimation method leverages anisotropy by placing edges and vertices along ridges and furrows, thereby achieving a more accurate approximation. Our main contribution is the derivation of the renowned quadric error measure from mesh decimation for screen space applications and its combination with optimal Delaunay triangulations.
Poster
Xinran Yang · Donghao Ji · Yuanqi Li · Jie Guo · Yanwen Guo · Junyuan Xie

[ ExHall D ]

Abstract
Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's attributes are primarily tailored and fine-tuned for rendering diverse 2D images by their anisotropic nature. To pave the way for efficient 3D reconstruction, we present Spherical Gaussians, a simple and effective representation for 3D geometric boundaries, from which we can directly reconstruct 3D feature curves from a set of calibrated multi-view images. Spherical Gaussians is optimized from grid initialization with a view-based rendering loss, where a 2D edge map is rendered at a specific view and then compared to the ground-truth edge map extracted from the corresponding image, without the need for any 3D guidance or supervision. Given Spherical Gaussians serve as intermedia for the robust edge representation, we further introduce a novel optimization-based algorithm called SGCR to directly extract accurate parametric curves from aligned Spherical Gaussians. We demonstrate that SGCR outperforms existing state-of-the-art methods in 3D edge reconstruction while enjoying great efficiency.
Poster
Zeyi Xu · Jinfan Liu · Kuangxu Chen · Ye Chen · Zhangli Hu · Bingbing Ni

[ ExHall D ]

Abstract
Accurately and efficiently simulating complex fluid dynamics is a challenging task that has traditionally relied on computationally intensive methods. Neural network-based approaches, such as convolutional and graph neural networks, have partially alleviated this burden by enabling efficient local feature extraction. However, they struggle to capture long-range dependencies due to limited receptive fields, and Transformer-based models, while providing global context, incur prohibitive computational costs. To tackle these challenges, we propose AMR-Transformer, an efficient and accurate neural CFD-solving pipeline that integrates a novel adaptive mesh refinement scheme with a Navier-Stokes constraint-aware fast pruning module. This design encourages long-range interactions between simulation cells and facilitates the modeling of global fluid wave patterns, such as turbulence and shockwaves. Experiments show that our approach achieves significant gains in efficiency while preserving critical details, making it suitable for high-resolution physical simulations with long-range dependencies. On CFDBench, PDEBench and a new shockwave dataset, our pipeline demonstrates up to an order-of-magnitude improvement in accuracy over baseline models. Additionally, compared to ViT, our approach achieves a reduction in FLOPs of up to 60 times.
Poster
Jianhui Wang · Zhifei Yang · Yangfan He · Huixiong Zhang · Yuxuan Chen · Jingwei Huang

[ ExHall D ]

Abstract
Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training a image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
Poster
Xiancheng Sun · Mai Xu · Shengxi Li · Senmao Ma · Xin Deng · Lai Jiang · Shen gang

[ ExHall D ]

Abstract
Panoramic image essentially acts as a pivotal role in emerging virtual reality and augmented reality scenarios; however, the generation of panoramic images are essentially challenging due to the intrinsic spherical geometry and spherical distortions caused by equirectangular projection (ERP). To address this, we start from the very basics of spherical manifold of panoramic images, and propose a novel spherical manifold convolution (SMConv) on S2 manifold. Based on the SMConv operation, we propose a spherical manifold guided diffusion (SMGD) model for text-conditioned panoramic image generation, which can well accommodate the spherical geometry during generation. We further develop a novel evaluation method by calculating grouped Fréchet inception distance (FID) on cube-map projections, which can well reflect the quality of generated panoramic images, compared to existing methods that randomly crop ERP-distorted content. Experiment results demonstrate that our SMGD model achieves the state-of-the-art generation quality and accuracy, whilst retaining the shortest sampling time in the text-conditioned panoramic image generation task.
Poster
Zilong Chen · Yikai Wang · Wenqiang Sun · Feng Wang · Yiwen Chen · Huaping Liu

[ ExHall D ]

Abstract
In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. Addressing the challenges faced by existing 3D native diffusion models, such as suboptimal auto-encoder performance, limited controllability, poor generalization, and inconsistent image-based PBR texturing, MeshGen employs several key innovations to overcome these limitations. We pioneer a render-enhanced point-to-shape auto-encoder that compresses meshes into a compact latent space, by designing perceptual optimization with ray-based regularization. This ensures that the 3D shapes are accurately represented and reconstructed to preserve geometric details within the latent space. To address data scarcity and image-shape misalignment, we further propose geometric augmentation and generative rendering augmentation techniques, which enhance the model's controllability and generalization ability, allowing it to perform well even with limited public datasets. For texture generation, MeshGen employs a reference attention-based multi-view ControlNet for consistent appearance synthesis. This is further complemented by our multi-view PBR decomposer that estimates PBR components and a UV inpainter that fills invisible areas, ensuring a seamless and consistent texture across the 3D mesh. Our extensive experiments demonstrate that MeshGen largely outperforms previous methods in both shape and texture generation, setting a new standard for the quality …
Poster
Soumyaratna Debnath · Ashish Tiwari · Kaustubh Sadekar · Shanmuganathan Raman

[ ExHall D ]

Abstract
Recent advancements in learning-based methods have opened new avenues for exploring and interpreting various artistic forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art, in which a specific ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to address the task of 3D object arrangement. We introduce RASP - a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume (or container) via shadow-guided optimization aiming towards minimal inter-object spacing and near-maximal occupancy. Here, the projected silhouette of the 3D object arrangement serves as its shadow. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. Since the 3D objects can also be "parts" of another 3D object, we demonstrate how RASP can be extended to part assembly alongside object packing. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.
Poster
Jesus Zarzar · Tom Monnier · Roman Shapovalov · Andrea Vedaldi · David Novotny

[ ExHall D ]

Abstract
We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways:1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination.3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feed-forward reconstruction networks, and comparable to significantly slower per-scene optimization methods.
Poster
Weiyu Li · Jiarui Liu · Hongyu Yan · Rui Chen · Yixun Liang · Xuelin Chen · Ping Tan · Xiaoxiao Long

[ ExHall D ]

Abstract
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, self-occlusion, irregular mesh topologies, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling softwares. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we first introduce a robust data preprocessing pipeline that utilizes visibility check and winding mumber to maximize the use of existing 3D data. Leveraging this data, we employ a 3D-native DiT model that directly models the distribution of 3D data in latent space, generating coarse geometries with regular mesh topology in seconds. Subsequently, a normal-based geometry refiner enhances local surface details, which can be applied automatically or interactively with user input. Extensive experiments demonstrate that our method achieves high efficacy in producing superior quality 3D assets compared to existing methods.
Poster
Jiantao Lin · Xin Yang · Meixi Chen · Xu Yingjie · Dongyu Yan · Leyi Wu · Xinli Xu · Lie XU · Shunsi Zhang · Ying-Cong Chen

[ ExHall D ]

Abstract
Diffusion models have achieved great success in generating 2D images. However, the quality and generalizability of 3D content generation remain limited. State-of-the-art methods often require large-scale 3D assets for training, which are challenging to collect. In this work, we introduce Kiss3DGen (Keep It Simple and Straightforward in 3D Generation), an efficient framework for generating, editing, and enhancing 3D objects by repurposing a well-trained 2D image diffusion model for 3D generation. Specifically, we fine-tune a diffusion model to generate "3D Bundle Image", a tiled representation composed of multi-view images and their corresponding normal maps. The normal maps are then used to reconstruct a 3D mesh, and the multi-view images provide texture mapping, resulting in a complete 3D model. This simple method effectively transforms the 3D generation problem into a 2D image generation task, maximizing the utilization of knowledge in pretrained diffusion models. Furthermore, we demonstrate that our Kiss3DGen model is compatible with various diffusion model techniques, enabling advanced features such as 3D editing, mesh and texture enhancement, etc. Through extensive experiments, we demonstrate the effectiveness of our approach, showcasing its ability to produce high-quality 3D models efficiently.
Poster
Minghao Chen · Roman Shapovalov · Iro Laina · Tom Monnier · Jianyuan Wang · David Novotny · Andrea Vedaldi

[ ExHall D ]

Abstract
Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure.However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce ParGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and …
Poster
Tongyuan Bai · Wangyuanfan Bai · Dong Chen · Tieru Wu · Manyi Li · Rui Ma

[ ExHall D ]

Abstract
Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph-based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis. Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph-based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.
Poster
Bahri Batuhan Bilecen · Yiğit Yalın · Ning Yu · Aysegul Dundar

[ ExHall D ]

Abstract
Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. Additionally, our framework demonstrates versatility and robustness across various domains, extending its effectiveness to animal face edits, partially stylized edits like cartoon faces, full-body clothing edits, and 360-degree head edits. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.
Poster
Hong-Xing Yu · Haoyi Duan · Charles Herrmann · William Freeman · Jiajun Wu

[ ExHall D ]

Abstract
We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene representations. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We will release full code and software for reproducibility.
Poster
Aashish Rai · Dilin Wang · Mihir Jain · Nikolaos Sarafianos · Kefan Chen · Srinath Sridhar · Aayush Prakash

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network.The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate …
Poster
Youngdong Jang · Hyunje Park · Feng Yang · Heeju Ko · Euijin Choo · Sangpil Kim

[ ExHall D ]

Abstract
As 3D Gaussian Splatting(3D-GS) gains significant attention and its commercial usage increases, the need for watermarking technologies to prevent unauthorized use of the 3D-GS models and rendered images has become increasingly important.In this paper, we introduce a robust watermarking method for 3D-GS that secures ownership of both the model and its rendered images.Our proposed method remains robust against distortions in rendered images and model attacks while maintaining high rendering quality. To achieve these objectives, we present Frequency-Guided Densification(FGD), which removes 3D Gaussians based on their contribution to rendering quality, enhancing real-time rendering and the robustness of the message. FGD utilizes Discrete Fourier Transform to split 3D Gaussians in high-frequency areas, improving rendering quality. Furthermore, we employ a gradient mask for 3D Gaussians and design a wavelet-subband loss to enhance rendering quality. Our experiments show that our method embeds the message in the rendered images invisibly and robustly against various attacks, including model distortion. Our method achieves state-of-the-art performance.
Poster
Alex Hanson · Allen Tu · Vasu Singla · Bethmage Mayuka Jayawardhana · Matthias Zwicker · Tom Goldstein

[ ExHall D ]

Abstract
Recent advancements in novel view synthesis have enabled real-time rendering speeds with high reconstruction accuracy.3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. However, complex scenes can consist of millions of Gaussians, resulting in high storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which Gaussians to remove. At high compression ratios, the pruned scenes suffer from heavy degradation of visual fidelity and loss of foreground details. In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing its training pipeline. After pruning \mathit{90\\%} of Gaussians, a substantially higher percentage than previous methods, our PUP 3D-GS pipeline increases average rendering speed by 3.56× while retaining more salient foreground information and achieving higher …
Poster
Luca Savant Aira · Gabriele Facciolo · Thibaud Ehret

[ ExHall D ]

Abstract
Recently, Gaussian splatting has emerged as a strong alternative to NeRF, demonstrating impressive 3D modeling capabilities while requiring only a fraction of the training and rendering time. In this paper, we show how the standard Gaussian splatting framework can be adapted for remote sensing, retaining its high efficiency. This enables us to achieve state-of-the-art performance in just a few minutes, compared to the day-long optimization required by the best-performing NeRF-based Earth observation methods. The proposed framework incorporates remote-sensing improvements from EO-NeRF, such as radiometric correction and shadow modeling, while introducing novel components, including sparsity, view consistency, and opacity regularizations.
Poster
Christopher Thirgood · Oscar Mendez · Erin Chao Ling · Jonathan Storey · Simon Hadfield

[ ExHall D ]

Abstract
We introduce HyperGS, a novel framework for Hyperspectral Novel View Synthesis (HNVS), based on a new latent 3D Gaussian Splatting (3DGS) technique. Our approach enables simultaneous spatial and spectral renderings by encoding material properties from multi-view 3D hyperspectral datasets. HyperGS reconstructs high-fidelity views from arbitrary perspectives with improved accuracy and speed, outperforming existing methods. To address the challenges of high-dimensional data, we perform view synthesis in a learned latent space, incorporating a pixel-wise adaptive density function and a pruning technique for increased training stability and efficiency. Additionally, we introduce the first HNVS benchmark, implementing a number of new baselines based on recent SOTA RGB-NVS techniques, alongside the small number of prior works on HNVS. We demonstrate HyperGS's robustness via extensive evaluations of real and simulated hyperspectral scenes. This work marks a significant step forward in efficient multi-view 3D geometry synthesis for hyperspectral modeling.
Poster
Junjin Xiao · Qing Zhang · Yongwei Nie · Lei Zhu · Wei-Shi Zheng

[ ExHall D ]

Abstract
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code will be made publicly available.
Poster
Jinfeng Liu · Lingtong Kong · Bo Li · Dan Xu

[ ExHall D ]

Abstract
High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes by leveraging multi-view low dynamic range (LDR) images captured at different exposure levels. Current training paradigms with 3D tone mapping often result in unstable HDR reconstruction, while training with 2D tone mapping reduces the model's capacity to fit LDR images. Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. Specifically, we design a residual local tone mapper for both 3D and 2D tone mapping that accepts an additional context feature as input. We then propose combining the dual LDR rendering results from both 3D and 2D local tone mapping at the loss level. Finally, recognizing that different scenes may exhibit varying balances between the dual results, we introduce uncertainty learning and use the uncertainties for adaptive modulation. Extensive experiments demonstrate that GaussHDR significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.
Poster
Antoine Guédon · Tomoki Ichikawa · Kohei Yamashita · Ko Nishino

[ ExHall D ]

Abstract
We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, \ie, two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha's state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism.
Poster
Haeyun Choi · Heemin Yang · Janghyeok Han · Sunghyun Cho

[ ExHall D ]

Abstract
In this paper, we propose DeepDeblurRF, a novel radiance field deblurring approach that can synthesize high-quality novel views from blurred training views with significantly reduced training time. DeepDeblurRF leverages deep neural network (DNN)-based deblurring modules to enjoy their deblurring performance and computational efficiency. To effectively combine DNN-based deblurring and radiance field construction, we propose a novel radiance field (RF)-guided deblurring and an iterative framework that performs RF-guided deblurring and radiance field construction in an alternating manner. Moreover, DeepDeblurRF is compatible with various scene representations, such as voxel grids and 3D Gaussians, expanding its applicability. We also present BlurRF-Synth, the first large-scale synthetic dataset for training radiance field deblurring frameworks. We conduct extensive experiments on both camera motion blur and defocus blur, demonstrating that DeepDeblurRF achieves state-of-the-art novel-view synthesis quality with significantly reduced training time.
Poster
Junfeng Ni · Yu Liu · Ruijie Lu · ZiRui Zhou · Song-Chun Zhu · Yixin Chen · Siyuan Huang

[ ExHall D ]

Abstract
Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms state-of-the-art methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization, yielding decomposed object meshes with …
Poster
Mohammad Asim · Christopher Wewer · Thomas Wimmer · Bernt Schiele · Jan Lenssen

[ ExHall D ]

Abstract
We introduce MET3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. MET3R uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MET3R, we evaluate the consistency of a large set of previous methods and our own, open, multi-view latent diffusion model.
Poster
Zhenggang Tang · Yuchen Fan · Dilin Wang · Hongyu Xu · Rakesh Ranjan · Alexander G. Schwing · Zhicheng Yan

[ ExHall D ]

Abstract
Recent sparse view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.
Poster
Chenjie Cao · Chaohui Yu · Shang Liu · Fan Wang · Xiangyang Xue · Yanwei Fu

[ ExHall D ]

Abstract
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset called MvD-1M, comprising up to 1.6 million scenes, equipped with well-aligned metric depth to train MVGenMaster. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released soon.
Poster
Maxim Shugaev · Vincent Chen · Maxim Karrenbach · Kyle Ashley · Bridget Kennedy · Naresh Cuntoor

[ ExHall D ]

Abstract
This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95% and reduces computational requirements by an order of magnitude, providing efficient …
Poster
Ningli Xu · Rongjun Qin

[ ExHall D ]

Abstract
Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.
Poster
Sibo Wu · Congrong Xu · Binbin Huang · Andreas Geiger · Anpei Chen

[ ExHall D ]

Abstract
Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g. , scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
Poster
Shengjun Zhang · Jinzhao Li · Xin Fei · Hao Liu · Yueqi Duan

[ ExHall D ]

Abstract
In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
Poster
Tsai-Shien Chen · Aliaksandr Siarohin · Willi Menapace · Yuwei Fang · Kwot Sin Lee · Ivan Skorokhodov · Kfir Aberman · Jun-Yan Zhu · Ming-Hsuan Yang · Sergey Tulyakov

[ ExHall D ]

Abstract
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist—a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly …
Poster
Haozhe Xie · Zhaoxi Chen · Fangzhou Hong · Ziwei Liu

[ ExHall D ]

Abstract
3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose **GaussianCity**, a generative Gaussian splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: **1)** *Compact 3D Scene Representation*: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. **2)** *Spatial-aware Gaussian Attribute Decoder*: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).
Poster
Xuanchi Ren · Tianchang Shen · Jiahui Huang · Huan Ling · Yifan Lu · Merlin Nimier-David · Thomas Müller · Alexander Keller · Sanja Fidler · Jun Gao

[ ExHall D ]

Abstract
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular …
Poster
Yingji Zhong · Zhihao Li · Dave Zhenyu Chen · Lanqing Hong · Dan Xu

[ ExHall D ]

Abstract
Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistency, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.
Poster
Jinxiu Liu · Shaoheng Lin · Yinxiao Li · Ming-Hsuan Yang

[ ExHall D ]

Abstract
Generating 360° panoramic and dynamic world content is important for spatial intelligence. The increasing demand for immersive AR/VR applications has heightened the need to generate high-quality scene-level dynamic content. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with only 11GB VRAM (Video Random Access Memory) regardless of the output video resolution.
Poster
Remy Sabathier · Niloy J. Mitra · David Novotny

[ ExHall D ]

Abstract
Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times t0 and t1, LIM produces a deformed shape at any continuous time t[t0,t1] delivering high-quality interpolations in seconds (per frame).Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.
Poster
Fangzhou Hong · Vladimir Guzov · Hyo Jin Kim · Yuting Ye · Richard Newcombe · Ziwei Liu · Lingni Ma

[ ExHall D ]

Abstract
As wearable devices become more prevalent, understanding the user's motion is crucial for improving contextual AI systems. We introduce EgoLM, a versatile framework designed for egocentric motion understanding using multi-modal data. EgoLM integrates the rich contextual information from egocentric videos and motion sensors afforded by wearable devices. It also combines dense supervision signals from motion and language, leveraging the vast knowledge encoded in pre-trained large language models (LLMs). EgoLM models the joint distribution of egocentric motions and natural language using LLMs, conditioned on observations from egocentric videos and motion sensors. It unifies a range of motion understanding tasks, including motion narration from video or motion data, as well as motion generation from text or sparse sensor data. Unique to wearable devices, it also enables a novel task to generate text descriptions from sparse sensors. Through extensive experiments, we validate the effectiveness of EgoLM in addressing the challenges of under-constrained egocentric motion learning, and demonstrate its capability as a generalist model through a variety of applications.
Poster
Jiahui Lei · Yijia Weng · Adam W Harley · Leonidas Guibas · Kostas Daniilidis

[ ExHall D ]

Abstract
We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models and lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions/deformations. The scene geometry and appearance are then disentangled from the deformation field and are encodedby globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera focal length and poses can be solved using bundle adjustment without the need of any other pose estimation tools.Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks and its effectiveness on real videos.
Poster
Boyuan Chen · Hanxiao Jiang · Shaowei Liu · Saurabh Gupta · Yunzhu Li · Hao Zhao · Shenlong Wang

[ ExHall D ]

Abstract
Envisioning physically plausible outcomes from a single image requires a deep understanding of the world’s dynamics. To address this, we introduce MiniTwin, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. By combining advanced image-based geometric and semantic understanding with physics-based simulation, MiniTwin creates an interactive 3D world from a static image, enabling us to "imagine" and simulate future scenarios based on user input. At its core, MiniTwin estimates 3D shapes, poses, physical and lighting properties of objects, thereby capturing essential physical attributes that drive realistic object interactions. This framework allows users to specify precise initial conditions, such as object speed or material properties, for enhanced control over generated video outcomes. We evaluate MiniTwin’s performance against closed-source state-of-the-art (SOTA) image-to-video models, including Pika, Kling, and Gen-3, showing MiniTwin's capacity to generate videos with realistic physics while offering greater flexibility and fine-grained control. Our results show that MiniTwin achieves a unique balance of photorealism, physical plausibility, and user-driven interactivity, opening new possibilities for generating dynamic, physics-grounded video from an image.
Poster
Marchellus Matthew · Nadhira Noor · In Kyu Park

[ ExHall D ]

Abstract
Fast 3D clothed human reconstruction from monocular video remains a significant challenge in computer vision, particularly in balancing computational efficiency with reconstruction quality. Current approaches are either focused on static image reconstruction but too computationally intensive, or achieve high quality through per-video optimization that requires minutes to hours of processing, making them unsuitable for real-time applications. To this end, we present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation while maintaining reconstruction quality. Our approach is a "plug-and play" solution that uniquely transforms pixel-aligned reconstruction networks to handle continuous video streams by maintaining and refining a canonical appearance representation through efficient coordinate mapping. Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics while providing high-quality textured reconstruction across diverse pose and appearance, with a maximum speed of 12 FPS.
Poster
Changan Chen · Juze Zhang · Shrinidhi Kowshika Lakshmikanth · Yusu Fang · Ruizhi Shao · Gordon Wetzstein · Li Fei-Fei · Ehsan Adeli

[ ExHall D ]

Abstract
Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities—either speech, text, or motion data—and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input and output. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our unified model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal.
Poster
Zhenyu Zhou · Chengdong Dong · Ajay Kumar

[ ExHall D ]

Abstract
The primary obstacle in realizing the full potential of finger crease biometrics is the accurate identification of deformed knuckle patterns, often resulting from completely contactless imaging. Current methods struggle significantly with this task, yet accurate matching is crucial for applications ranging from forensic investigations, such as child abuse cases, to surveillance and mobile security. To address this challenge, our study introduces the largest publicly available dataset of deformed knuckle patterns, comprising 805,768 images from 351 subjects. We also propose a novel framework to accurately match knuckle patterns, even under severe pose deformations, by recovering interpretable knuckle crease keypoint feature templates. These templates can dynamically uncover graph structure and feature similarity among the matched correspondences. Our experiments, using the most challenging protocols, illustrate significantly outperforming results for matching such knuckle images. For the first time, we present and evaluate a theoretical model to estimate the uniqueness of 2D finger knuckle patterns, providing a more interpretable and accurate measure of distinctiveness, which is invaluable for forensic examiners in prosecuting suspects.
Poster
Yuhan Bao · Shaohua Gao · Wenyong Li · Kaiwei Wang

[ ExHall D ]

Abstract
High-speed autofocus in extreme scenes remains a significant challenge. Traditional methods rely on repeated sampling around the focus position, resulting in ''focus hunting''. Event-driven methods have advanced focusing speed and improved performance in low-light conditions; however, current approaches still require at least one lengthy round of ''focus hunting'', involving the collection of a complete focus stack. We introduce the Event Laplacian Product (ELP) focus detection function, which combines event data with grayscale Laplacian information, redefining focus search as a detection task. This innovation enables the first one-step event-driven autofocus, cutting focusing time by up to two-thirds and reducing focusing error by 24 times on the DAVIS346 dataset and 22 times on the EVK4 dataset. Additionally, we present an autofocus pipeline tailored for event-only cameras, achieving accurate results across a range of challenging motion and lighting conditions. All datasets and code will be made publicly available.
Poster
Yanchen Dong · Ruiqin Xiong · Xiaopeng Fan · Zhaofei Yu · Yonghong Tian · Tiejun Huang

[ ExHall D ]

Abstract
Spike camera is a kind of neuromorphic camera with ultra-high temporal resolution, which can capture dynamic scenes by continuously firing spike signals. To capture color information, a color filter array (CFA) is employed on the sensor of the spike camera, resulting in Bayer-pattern spike streams. How to restore high-quality color images from the binary spike signals remains challenging. In this paper, we propose a motion-guided reconstruction method for spike cameras with CFA, utilizing color layout and estimated motion information. Specifically, we develop a joint motion estimation pipeline for the Bayer-pattern spike stream, exploiting the motion consistency of channels. We propose to estimate the missing pixels of each color channel according to temporally neighboring pixels of the corresponding color along the motion trajectory. As the spike signals are read out at discrete time points, there is quantization noise that impacts the image quality. Thus, we analyze the correlation of the noise in spatial and temporal domains and propose a self-supervised network utilizing a masked spike encoder to handle the noise. Experimental results on real-world captured Bayer-pattern spike streams show that our method can restore color images with better visual quality, compared with state-of-the-art methods. All the source codes and datasets will …
Poster
Kazuma Kitazawa · Takahito Aoto · Satoshi Ikehata · Tsuyoshi Takatani

[ ExHall D ]

Abstract
Recently, the energy-efficient photometric stereo method using an event camera (EventPS) has been proposed to recover surface normals from events triggered by changes in logarithmic Lambertian reflections under a moving directional light source. However, EventPS treats each event interval independently, making it sensitive to noise, shadows, and non-Lambertian reflections. This paper proposes Photometric Stereo based on Event Interval Profile (PS-EIP), a robust method that recovers pixelwise surface normals from a time-series profile of event intervals. By exploiting the continuity of the profile and introducing an outlier detection method based on profile shape, our approach enhances robustness against outliers from shadows and specular reflections. Experiments using real event data from 3D-printed objects demonstrate that PS-EIP significantly improves robustness to outliers compared to EventPS's deep-learning variant, EventPS-FCN, without relying on deep learning.
Poster
Yongfan Liu · Hyoukjun Kwon

[ ExHall D ]

Abstract
Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high accuracy. Also, non standard ML operator based algorithms such as cost volume also require significant latency, which is aggravated on compute resource-constrained mobile platforms. Therefore, we develop hardware-friendly alternatives to the costly cost volume and preprocessing and design two new models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost volume is replacing it with a new group-pointwise convolution-based operator and approximation of consine similarity based on layernorm and dot product. For online stereo rectification (preprocessing), we introduce homograhy matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images, which eliminates the needs for preprocessing. Our MultiHeadDepth, which includes optimized cost volume, provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. Our HomoDepth, which includes optimized preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified images and reduce the end-to-end latency by 44.5%. We adopt a multi-task learning …
Poster
Jinhong Wang · Jintai Chen · Jian liu · Dongqi Tang · Wentong Li · Weiqiang Wang · Danny Chen · Jian Wu

[ ExHall D ]

Abstract
This paper proposes a new autoregressive model as an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.
Poster
Haoyu Guo · He Zhu · Sida Peng · Haotong Lin · Yunzhi Yan · Tao Xie · Wenguan Wang · Xiaowei Zhou · Hujun Bao

[ ExHall D ]

Abstract
This paper aims to reconstruct the scene geometry from multi-view images with strong robustness and high quality. Previous learning-based methods incorporate neural networks into the multi-view stereo matching and have shown impressive reconstruction results. However, due to the reliance on matching across input images, they typically suffer from high GPU memory consumption and tend to fail in sparse view scenarios. To overcome this problem, we develop a new pipeline, named Murre, for multi-view geometry reconstruction of 3D scenes based on SfM-guided monocular depth estimation. For input images, Murre first recover the SfM point cloud that captures the global scene structure, and then use it to guide a conditional diffusion model to produce multi-view metric depth maps for the final TSDF fusion. By predicting the depth map from a single image, Murre bypasses the multi-view matching step and naturally resolves the issues of previous MVS-based methods. In addition, the diffusion-based model can easily leverage the powerful priors of 2D foundation models, achieving good generalization ability across diverse real-world scenes. To obtain multi-view consistent depth maps, our key design is providing effective guidance on the diffusion model through the SfM point cloud, which is a condensed form of multi-view information, highlighting the …
Poster
Bowen Wen · Matthew Trepte · Oluwaseun Joseph Aribido · Jan Kautz · Orazio Gallo · Stan Birchfield

[ ExHall D ]

Abstract
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce StereoAnything, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.
Poster
JunDa Cheng · Longliang Liu · Gangwei Xu · Xianqi Wang · Zhaoxing Zhang · Yong Deng · Jinliang Zang · Yurui Chen · zhipeng cai · Xin Yang

[ ExHall D ]

Abstract
Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery, and utilizes explicit monocular depth priors to enhance stereo matching at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.2, MonSter ranks 1st across five most commonly used leaderboards --- SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements over the previous best method (Bad 1.0 on ETH3D). Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art methods across the board. Code will be released upon acceptance.
Poster
Juhyung Choi · Jinneyong Kim · Seokjun Choi · Jinwoo Lee · Samuel Brucker · Mario Bijelic · Felix Heide · Seung-Hwan Baek

[ ExHall D ]

Abstract
Achieving robust stereo 3D imaging under diverse illumination conditions is an importat however challenging task, largely due to the limited dynamic ranges (DRs) of cameras, which are significantly smaller than real world DR. As a result, the accuracy of existing stereo depth estimation methods is often compromised by under- or over-exposed images.In this work, we introduce dual-exposure stereo for extended dynamic range 3D imaging. We develop automatic dual-exposure control method that adjusts the dual exposures, diverging them when the scene DR exceeds the camera DR, thereby providing information about broader DR. From the captured dual-exposure stereo images, we estimate depth by developing a motion-aware dual-exposure stereo depth network.To validate our proposed method, we develop a robot-vision system, collect real-world stereo video datasets, and generate a synthetic dataset. Our approach outperforms traditional exposure control and depth estimation methods.
Poster
Kaining Zhang · Yuxin Deng · Jiayi Ma · Paolo Favaro

[ ExHall D ]

Abstract
Current deep homography estimation methods are constrained to processing low-resolution image pairs due to network architecture and computational limitations. For high-resolution images, downsampling is often required, which can greatly degrade estimation accuracy. In contrast, traditional methods, which match pixels and compute homography from correspondences, provide greater resolution flexibility. So in this work, we go back to the traditional ways for homography estimation. Specifically, we propose GFNet, a Grid Flow regression Network that adapts the high-accuracy dense matching framework for homography estimation and enhances efficiency with a grid-based strategy—estimating flow only over a coarse grid by leveraging homography’s global smoothness. We demonstrate the effectiveness of GFNet on a wide range of experiments on multiple datasets, including the common scene MSCOCO, multimodal datasets VIS-IR and GoogleMap, and the dynamic scene VIRAT. Specifically, on 448×448 GoogleMap, GFNet achieves an improvement of +9.9\% in auc@3 while reducing MACs by 47\% compared to the SOTA dense matching method. Additionally, it shows a 1.7× improvement in auc@3 over the SOTA deep homography method.
Poster
Patrick Rim · Hyoungseob Park · Suchisrit Gangopadhyay · Ziyao Zeng · Younjoon Chung · Alex Wong

[ ExHall D ]

Abstract
We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.
Poster
Jianyuan Wang · Minghao Chen · Nikita Karaev · Andrea Vedaldi · Christian Rupprecht · David Novotny

[ ExHall D ]

Abstract
We present VGGN, a feed-forward neural network that infers directly all key 3D attributes of a scene, such as camera poses, point maps, depth maps, and 3D point tracks, from few or hundreds of its views. Unlike recent alternatives, VGGN does not need to use visual geometry optimization techniques to refine the results in post-processing, obtaining all quantities of interest directly. This approach is simple and more efficient, reconstructing hundreds of images in seconds. We train VGGN on a large number of publicly available datasets with 3D annotations and demonstrate its ability to achieve state-of-the-art results in multiple 3D tasks, including camera pose estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. This is a step forward in 3D computer vision, where models have been typically constrained to and specialized for single tasks. We extensively evaluate our method on unseen datasets to demonstrate its superior performance. We will release the code and trained model.
Poster
Qitao Zhao · Amy Lin · Jeff Tan · Jason Y. Zhang · Deva Ramanan · Shubham Tulsiani

[ ExHall D ]

Abstract
Current Structure-from-Motion (SfM) methods often adopt a two-stage pipeline involving learned or geometric pairwise reasoning followed by a global optimization. We instead propose a data-driven multi-view reasoning approach that directly infers cameras and 3D geometry from multi-view images. Our proposed framework, DiffusionSfM, parametrizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame, and learns a transformer-based denoising diffusion model to predict these from multi-view input. We develop mechanisms to overcome practical challenges in training diffusion models with missing data and unbounded scene coordinates, and demonstrate that DiffusionSfM allows accurate prediction of 3D and cameras. We empirically validate our approach on challenging real world data and find that DiffusionSfM improves over prior classical and learning-based methods, while also naturally modeling uncertainty and allowing external guidance to be incorporated in inference.
Poster
Laurie Bose · Piotr Dudek · Jianing Chen

[ ExHall D ]

Abstract
This paper presents a novel approach for joint point-feature detection and tracking, specifically designed for Pixel Processor Array sensors (PPA). Instead of standard pixels, PPA sensors consists of thousands of "pixel-processors", enabling massive parallel computation of visual data at the point of light capture. Our approach performs all computation within these pixel-processors, meaning no raw image data need ever leave the sensor. Instead, sensor output can be reduced to merely the locations of tracked features, and the descriptors of newly initialized features, minimizing data transfer between sensor and external processing. To achieve this we store feature descriptors inside every pixel-processor, adjusting the layout of these descriptors every frame. The PPA's architecture enables us to compute the response of every stored descriptor in parallel. This "response map" is utilized for both detection and tracking of point-features across the pixel-processor array. This approach is very fast, our implementation upon the SCAMP-7 PPA prototype runs at over 3000 FPS (Frames Per Second), tracking point-features reliably even under violent motion. This is the first work performing point-feature detection and tracking entirely "in-pixel".
Poster
Zeran Ke · Bin Tan · Xianwei Zheng · Yujun Shen · Tianfu Wu · Nan Xue

[ ExHall D ]

Abstract
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under the zero-shot protocol in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent perfor- mance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images.
Poster
Guénolé Fiche · Simon Leglaive · Xavier Alameda-Pineda · Francesc Moreno-Noguer

[ ExHall D ]

Abstract
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches. Code and trained models will be released upon acceptance.
Poster
Yan Xia · Xiaowei Zhou · Etienne Vouga · Qixing Huang · Georgios Pavlakos

[ ExHall D ]

Abstract
In this paper, we introduce a method for reconstructing humans in 3D from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to generate pseudo ground truth data and implement a training procedure that iteratively refines these pseudo labels for improved accuracy. Compared to state-of-the-art methods in 3D human pose estimation, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. This result highlights the benefits of using a biomechanical skeleton with realistic degrees of freedom for robust pose estimation. Additionally, we show that previous models frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom leading to more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We will make all code, models and data publicly available upon publication.
Poster
Dongki Jung · Jaehoon Choi · Yonghan Lee · Somi Jeong · Taejae Lee · Dinesh Manocha · Suyong Yeon

[ ExHall D ]

Abstract
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets. Project page: https://anonymous000edm.github.io/
Poster
Yue Chen · Xingyu Chen · Anpei Chen · Gerard Pons-Moll · Yuliang Xiu

[ ExHall D ]

Abstract
Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS , which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture without requiring 3D data. Additionally, the disentanglement of 3DGS parameters – geometry (x, α, Σ) and texture (c) – enables separate analysis of texture and geometry awareness. Under Feat2GS , we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS …
Poster
Zimin Xia · Alex Alahi

[ ExHall D ]

Abstract
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 42% on the VIGOR dataset. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the ground truth camera pose.
Poster
Philip Doldo · Derek Everett · Amol Khanna · Andre T Nguyen · Edward Raff

[ ExHall D ]

Abstract
Projected Gradient Descent (PGD) under the L ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when thousands of iterations is current best-practice recommendations to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the *exact* same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable
Poster
Ayush Shrivastava · Andrew Owens

[ ExHall D ]

Abstract
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model finds which pairs of pixels within them correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle consistent feature representations for cross-modal and intra-modal matching. The resulting model is simple and has no explicit photoconsistency assumptions. It can be trained entirely using unlabeled data, without need for any spatially aligned multimodal image pairs. We evaluate our method on challenging RGB-to-depth and RGB-to-thermal matching problems (and vice versa), where we find that it obtains strong performance.
Poster
Gonglin Chen · Tianwen Fu · Haiwei Chen · Wenbin Teng · Hanyuan Xiao · Yajie Zhao

[ ExHall D ]

Abstract
As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present RDD, a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we introduce a novel refinement module for semi-dense matching, which does not rely on fine-level features. Our proposed methods outperform all state-of-the-art keypoint detection/description methods in feature matching, pose estimation, and visual localization tasks. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being a novel Air-to-Ground benchmark — an evaluation setting which has gained popularity in recent years for 3D reconstruction at different altitudes. Our code and benchmarks will be released upon acceptance of the paper.
Poster
JongMin Lee · Sungjoo Yoo

[ ExHall D ]

Abstract
We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods.
Poster
Yuto Matsubara · Ko Nishino

[ ExHall D ]

Abstract
We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.
Poster
Ben Kaye · Tomas Jakab · Shangzhe Wu · Christian Rupprecht · Andrea Vedaldi

[ ExHall D ]

Abstract
The choice of data representation is a key factor in the success of deep learning in geometric tasks. For instance, DUSt3R has recently introduced the concept of viewpoint-invariant point maps, generalizing depth prediction, and showing that one can reduce all the key problems in the 3D reconstruction of static scenes to predicting such point maps. In this paper, we develop an analogous concept for a very different problem, namely, the reconstruction of the 3D shape and pose of deformable objects. To this end, we introduce the Dual Point Maps (DualPM), where a pair of point maps is extracted from the same image, one associating pixels to their 3D locations on the object, and the other to a canonical version of the object at rest pose. We also extend point maps to amodal reconstruction, seeing through self-occlusions to obtain the complete shape of the object. We show that 3D reconstruction and 3D pose estimation reduce to the prediction of the DualPMs. We demonstrate empirically that this representation is a good target for a deep network to predict; specifically, we consider modelling horses, showing that DualPMs can be trained purely on 3D synthetic data, consisting of a single model of a horse, …
Poster
Tuo Cao · Fei LUO · Jiongming Qin · Yu Jiang · Yusen Wang · Chunxia Xiao

[ ExHall D ]

Abstract
Traditional methods in pose estimation often rely on precise 3D models or additional data such as depth and normals, limiting their generalization, especially when objects undergo large translations or rotations. We propose iG-6DoF, a novel model-free 6D pose estimation method iterative 3D Gaussian Splatting to estimate the pose of unseen objects. We first estimates an initial pose by leveraging multi-scale data augmentation and the rotation-equivariant features to create a better pose hypothesis from a set of candidates. Then, we propose an iterative 3DGS approach through iteratively rendering and comparing the rendered image with the input image to further progressively improve pose estimation accuracy. The proposed end-to-end network consists of an object detector, a multi-scale rotation-equivariant feature based initial pose estimator, and a coarse-to-fine pose refiner. Such combination allows our method to focus on the target object in a complex scene and deal with large movement and weak textures.Such framework allows to deal with large movement and weak texture and complex scene. We conduct extensive experiments on benchmark and real datasets. Our method achieves state-of-the-art results on the LINEMOD, OnePose-LowTexture, and GenMOP datasets, demonstrating its strong generalization to unseen objects and robustness across various scenes.
Poster
Jaeguk Kim · Jaewoo Park · Keuntek Lee · Nam Ik Cho

[ ExHall D ]

Abstract
Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.
Poster
Mengya Liu · Siyuan Li · Ajad Chhatkuli · Prune Truong · Luc Van Gool · Federico Tombari

[ ExHall D ]

Abstract
6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints.We treat object pose estimation as an encoding-decoding process: first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object’s shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability.Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.
Poster
Leonhard Sommer · Olaf Dünkel · Christian Theobalt · Adam Kortylewski

[ ExHall D ]

Abstract
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object …
Poster
Burak Bekci · Nassir Navab · Federico Tombari · Mahdi Saleh

[ ExHall D ]

Abstract
Shape completion, a crucial task in 3D computer vision, involves predicting and filling the missing regions of scanned or partially observed objects. Current methods expect known pose or canonical coordinates and do not perform well under varying rotations, limiting their real-world applicability. We introduce \textbf{ESCAPE} (Equivariant Shape Completion via Anchor Point Encoding), a novel framework designed to achieve rotation-equivariant shape completion. Our approach employs a distinctive encoding strategy by selecting anchor points from a shape and representing all points as a distance to all anchor points. This enables the model to capture a consistent, rotation-equivariant understanding of the object’s geometry. ESCAPE leverages a transformer architecture to encode and decode the distance transformations, ensuring that generated shape completions remain accurate and equivariant under rotational transformations. Subsequently, we perform optimization to calculate the predicted shapes from the encodings. Experimental evaluations demonstrate that ESCAPE achieves robust, high-quality reconstructions across arbitrary rotations and translations, showcasing its effectiveness in real-world applications without additional pose estimation modules.
Poster
Jiayang Ao · Yanbei Jiang · Qiuhong Ke · Krista A. Ehinger

[ ExHall D ]

Abstract
Understanding and reconstructing occluded objects is a challenging problem, especially in open-world scenarios where categories and contexts are diverse and unpredictable. Traditional methods, however, are typically restricted to closed sets of object categories, limiting their use in complex, open-world scenes. We introduce Open-World Amodal Appearance Completion, a training-free framework that expands amodal completion capabilities by accepting flexible text queries as input. Our approach generalizes to arbitrary objects specified by both direct terms and abstract queries. We term this capability reasoning amodal completion, where the system reconstructs the full appearance of the queried object based on the provided image and language query. Our framework unifies segmentation, occlusion analysis, and inpainting to handle complex occlusions and generates completed objects as RGBA elements, enabling seamless integration into applications such as 3D reconstruction and image editing. Extensive evaluations demonstrate the effectiveness of our approach in generalizing to novel objects and occlusions, establishing a new benchmark for amodal completion in open-world settings. The code and datasets will be released after paper acceptance.
Poster
Chuanyu Sun · Jiqing Zhang · Yang Wang · Huilin Ge · qianchen xia · Baocai Yin · Xin Yang

[ ExHall D ]

Abstract
Combining the advantages of conventional and event cameras for robust visual tracing has drawn extensive interest. However, existing tracking approaches heavily engage in complex cross-modal fusion modules, leading to higher computational complexity and training challenges. Besides, these methods generally ignore the effective integration of historical information, which is crucial to grasping the change in the target's appearance and motion trends. Given the recent advancements in Mamba's long-range modeling and linear complexity, we explore its potential in addressing the above issues in RGBE tracking tasks. Specifically, we first propose an efficient fusion module based on Mamba, which utilizes a simple gate-based interaction scheme to achieve effective modality-selective fusion. This module can be seamlessly integrated into the encoding layer of prevalent Transformer-based backbones. Moreover, we further present a novel historical decoder that leverages Mamba's advanced long sequence modeling to effectively capture the target appearance changes with autoregressive queries. Extensive experiments show that our proposed approach achieves state-of-the-art performance on multiple challenging short-term and long-term RGBE benchmarks. Besides, the effectiveness of each key Mamba-based component of our approach is evidenced by our thorough ablation study.
Poster
Albert Reed · Connor Hashemi · Dennis Melamed · Nitesh Menon · Keigo Hirakawa · Scott McCloskey

[ ExHall D ]

Abstract
Event-based sensors (EBS) are a promising new technology for star tracking due to their low latency and power efficiency, but prior work has thus far been evaluated exclusively in simulation with simplified signal models. We propose a novel algorithm for event-based star tracking, grounded in an analysis of the EBS circuit and an extended Kalman filter (EKF). We quantitatively evaluate our method using real night sky data, comparing its results with those from a space-ready active-pixel sensor (APS) star tracker. We demonstrate that our method is an order-of-magnitude more accurate than existing methods due to improved signal modeling and state estimation, while providing more frequent updates and greater motion tolerance than conventional APS trackers. We provide all code and the first dataset of events synchronized with APS solutions.
Poster
Fanqi Pu · Yifan Wang · Jiru Deng · Wenming Yang

[ ExHall D ]

Abstract
Perspective projection has been extensively utilized in monocular 3D object detection methods. It introduces geometric priors from 2D bounding boxes and 3D object dimensions to reduce the uncertainty of depth estimation. However, due to depth errors originating from the object's visual surface, the height of the bounding box often fails to represent the actual projected central height, which undermines the effectiveness of geometric depth. Direct prediction for the projected height unavoidably results in a loss of 2D priors, while multi-depth prediction with complex branches does not fully leverage geometric depth. This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. We also try to systematically discuss and explain the mechanisms and efficacy behind geometry errors, which serve as a simple but effective alternative to multi-depth prediction. Additionally, MonoDGP decouples the depth-guided decoder and constructs a 2D decoder only dependent on visual features, providing 2D priors and initializing object queries without the disturbance of 3D detection. To further optimize and fine-tune input tokens of the transformer decoder, we also introduce a Region Segmentation Head (RSH) that generates enhanced features and segment embeddings. Our monocular method demonstrates state-of-the-art performance on …
Poster
Rishubh Parihar · Srinjay Sarkar · Sarthak Vora · Jogendra Kundu Kundu · R. Venkatesh Babu

[ ExHall D ]

Abstract
Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, it's particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.
Poster
Lu Sang · Zehranaz Canfes · Dongliang Cao · Riccardo Marin · Florian Bernard · Daniel Cremers

[ ExHall D ]

Abstract
Generating realistic intermediate shapes between non-rigidly deformed shapes is a challenging task in computer vision, especially with unstructured data (e.g., point clouds) where temporal consistency across frames is lacking, and topologies are changing. Most interpolation methods are designed for structured data (i.e., meshes) and do not apply to real-world point clouds. In contrast, our approach leverages neural implicit representation (NIR) to enable free-topology changing shape deformation. Unlike previous mesh-based methods, which model learns vertex-based deformation fields, our method learns a continuous velocity field in Euclidean space, making it suitable for less structured data such as point clouds.Additionally, our method does not require intermediate-shape supervision during training; instead, we incorporate physical and geometrical constraints to regularize the velocity field. We reconstruct intermediate surfaces using a modified level-set equation, directly linking our NIR with the velocity field. Experiments show that our method significantly outperforms previous NIR approaches across various scenarios (e.g., noisy, partial, topology-changing, non-isometric shapes) and, for the first time, enables new applications like 4D Kinect sequence upsampling and real-world high-resolution mesh deformation.
Poster
Amine Ouasfi · Shubhendu Jena · Eric Marchand · Adnane Boukhayma

[ ExHall D ]

Abstract
We consider the challenging problem of learning Signed Distance Functions (SDF) from sparse and noisy 3D point clouds. In contrast to recent methods that depend on smoothness priors, our method, rooted in a distributionally robust optimization (DRO) framework, incorporates a regularization term that leverages samples from the uncertainty regions of the model to improve the learned SDFs. Thanks to tractable dual formulations, we show that this framework enables a stable and efficient optimization of SDFs in the absence of ground truth supervision. Using a variety of synthetic and real data evaluations from different modalities, we show that of our DRO based learning framework can improve SDF learning with respect to baselines and the state-of-the-art.
Poster
Ruicheng Wang · Sicheng Xu · Cassie Lee Dai · Jianfeng XIANG · Yu Deng · Xin Tong · Jiaolong Yang

[ ExHall D ]

Abstract
We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitates effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervision techniques that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view.
Poster
Qirui Huang · Runze Zhang · Kangjun Liu · Minglun Gong · Hao Zhang · Hui Huang

[ ExHall D ]

Abstract
We introduce ArcPro, a novel learning framework built on architectural programs to recover structured 3D abstractions from highly sparse and low-quality point clouds. Specifically, we design a domain-specific language (DSL) to hierarchically represent building structures as a program, which can be efficiently converted into a mesh. We bridge feedforward and inverse procedural modeling by using a feedforward process for training data synthesis, allowing the network to make reverse predictions. We train an encoder-decoder on the points-program pairs to establish a mapping from unstructured point clouds to architectural programs, where a 3D convolutional encoder extracts point cloud features and a transformer decoder autoregressively predicts the programs in a tokenized form. Inference by our method is highly efficient and produces plausible and faithful 3D abstractions. Comprehensive experiments demonstrate that ArcPro outperforms both traditional architectural proxy reconstruction and learning-based abstraction methods. We further explore its potential when working with multi-view image and natural language inputs.
Poster
Johan Edstedt · André Mateus · Alberto Jaenal

[ ExHall D ]

Abstract
Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions.In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets.To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code and data will be made available.
Poster
Xu Han · Yuan Tang · Jinfeng Xu · Xianzhi Li

[ ExHall D ]

Abstract
We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with any 3D representation learning backbone. At its core, we tailor a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, such as 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
Poster
Liyan Chen · Gregory P. Meyer · Zaiwei Zhang · Eric M. Wolff · Paul Vernaza

[ ExHall D ]

Abstract
Recent efforts recognize the power of scale in 3D learning (e.g. PTv3) and attention mechanisms (e.g. FlashAttention).However, current point cloud backbones fail to holistically unify geometric locality, attention mechanisms, and GPU architectures in one view.In this paper, we introduce Flash3D Transformer, which aligns geometric locality and GPU tiling through a principled locality mechanism based on Perfect Spatial Hashing (PSH).The common alignment with GPU tiling naturally fuses our PSH locality mechanism with FlashAttention at negligible extra cost.This mechanism affords flexible design choices throughout the backbone that result in superior downstream task results.Flash3D outperforms state-of-the-art PTv3 results on benchmark datasets, delivering a 2.25x speed increase and 2.4x memory efficiency boost. This efficiency enables scaling to wider attention scopes and larger models without additional overhead.Such scaling allows Flash3D to achieve even higher task accuracies than PTv3 under the same compute budget.
Poster
Song Wang · Xiaolu Liu · Lingdong Kong · Jianyun Xu · Chunyong Hu · Gongfan Fang · Wentong Li · Jianke Zhu · Xinchao Wang

[ ExHall D ]

Abstract
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Code will be publicly available.
Poster
Feifei Shao · Ping Liu · Zhao Wang · Yawei Luo · Hongwei Wang · Jun Xiao

[ ExHall D ]

Abstract
Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable 4.1% improvement in the part segmentation task and delivers consistent gains across various PCP applications.
Poster
Chuanfu Shen · Rui Wang · Lixin Duan · Shiqi Yu

[ ExHall D ]

Abstract
Point clouds have gained growing interest in gait recognition. However, current methods, which typically convert point clouds into 3D voxels, often fail to extract essential gait-specific features. In this paper, we explore gait recognition within 3D point clouds from the perspectives of architectural designs and gait representation modeling. We indicate the significance of local and body size features in 3D gait recognition and introduce S3GaitNet, a novel framework combining advanced local representation learning techniques with a novel size-aware learning mechanism. Specifically, S3GaitNet utilizes Set Abstraction (SA) layer and Pyramid Point Pooling (P3) layer for learning locally fine-grained gait representations from 3D point clouds directly. Both the SA and P3 layers can be further enhanced with size-aware learning to make the model aware of the actual size of the subjects. In the end, S3GaitNet not only outperforms current state-of-the-art methods, but it also consistently demonstrates robust performance and great generalizability on two benchmarks. Our extensive experiments validate the effectiveness of size and local features in gait recognition. The code will be publicly available.
Poster
Yanlong Xu · Haoxuan Qu · Jun Liu · Wenxiao Zhang · Xun Yang

[ ExHall D ]

Abstract
The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this partially relevant challenge, we propose CMMLoc, an uncertainty-aware Cauchy-Mixture-Model (CMM) based framework for text-to-point-cloud Localization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose …
Poster
Ethan Griffiths · Maryam Haghighat · Simon Denman · Clinton Fookes · Milad Ramezani

[ ExHall D ]

Abstract
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. Leveraging an octree-based structure, we propose a multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from common spinning lidar, we present cylindrical octree attention windows to better reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce our novel dataset: In-house, a 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in In-house contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. Our results demonstrate that HOTFormerLoc achieves a top-1 average recall improvement of 5.5\% -- 11.5\% on the In-house benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 5.8\% on well-established urban and forest datasets. …
Poster
Yanqing Shen · Turcan Tuna · Marco Hutter · Cesar Cadena · Nanning Zheng

[ ExHall D ]

Abstract
Place recognition is essential to maintain global consistency in large-scale localization systems. While research in urban environments has progressed significantly using LiDARs or cameras, applications in natural forest-like environments remain largely underexplored.Furthermore, forests present particular challenges due to high self-similarity and substantial variations in vegetation growth over time.In this work, we propose a robust LiDAR-based place recognition method for natural forests, ForestLPR. We hypothesize that a set of cross-sectional images of the forest’s geometry at different heights contains the information needed to recognize revisiting a place.The cross-sectional images are represented by bird’s-eye view (BEV) density images of horizontal slices of the point cloud at different heights. Our approach utilizes a visual transformer as the shared backbone to produce sets of local descriptors and introduces a multi-BEV interaction module to attend to information at different heights adaptively. It is followed by an aggregation layer that produces a rotation-invariant place descriptor. We evaluated the efficacy of our method extensively on real-world data from public benchmarks as well as robotic datasets and compared it against the state-of-the-art (SOTA) methods. The results indicate that ForestLPR has consistently good performance on all evaluations and achieves an average increase of 7.38\% and 9.11\% on Recall@1 over …
Poster
Barza Nisar · Steven L. Waslander

[ ExHall D ]

Abstract
Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. Further, no methods exist that exhibit transferable performance across different LiDAR beam patterns. We propose PSA-SSL, a novel extension to point cloud SSL that learns LiDAR pattern-agnostic and object pose and size-aware (PSA) features. Our approach defines 1) a self-supervised bounding box regression pretext task, which retains object pose and size information, and 2) a LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pre-trained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels (up to \textbf{+2.66 mIoU} for DepthContrast and \textbf{+2.39 mIoU} for SegContrast) across popular autonomous driving datasets (Waymo, NuScenes, SemanticKitti). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation and shows improvements on 3D object detection when …
Poster
Wen Li · Chen Liu · Shangshu Yu · dq Liu · Yin Zhou · Siqi Shen · Chenglu Wen · Cheng Wang

[ ExHall D ]

Abstract
Scene Coordinate Regression (SCR) achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for applications requiring time-sensitive system upgrades, such as autonomous driving, drones, robotics, etc. We identify large coverage areas and vast amounts of data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. Beyond freezing the scene-agnostic feature backbone and training only the scene-specific prediction heads, we introduce two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose a redundant sample downsampling technique to remove well-learned frames during training, reducing training time without compromising accuracy. In addition, the fast training and confidence estimation characteristics of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with just 1 hour of training—50× faster than existing methods. Our code will be made available upon acceptance.
Poster
Junsung Park · HwiJeong Lee · Inha Kang · Hyunjung Shim

[ ExHall D ]

Abstract
Existing domain generalization methods for LiDAR semantic segmentation under adverse weather struggle to accurately predict "things" categories compared to "stuff'' categories. In typical driving scenes, "things" categories can be dynamic and associated with higher collision risks, making them crucial for safe navigation and planning. Recognizing the importance of "things'' categories, we identify their performance drop as a serious bottleneck in existing approaches. We observed that adverse weather induces degradation of semantic-level features and both corruption of local features, leading to a misprediction of "things" as "stuff". To address semantic-level feature corruption, we bind each point feature to its superclass, preventing the misprediction of \emph{things} classes into visually dissimilar categories. Additionally, to enhance robustness against local corruption caused by adverse weather, we define each LiDAR beam as a local region and propose a regularization term that aligns the clean data with its corrupted counterpart in feature space. Our method achieves state-of-the-art performance with a +2.6 mIoU gain on the SemanticKITTI-to-SemanticSTF benchmark and +7.9 mIoU on the SemanticPOSS-to-SemanticSTF benchmark. Notably, our method achieves a +4.8 and +7.9 mIoU improvement on "things'' classes, respectively, highlighting its effectiveness.
Poster
Van-Tin Luu · Yong-Lin Cai · Vu-Hoang Tran · Wei-Chen Chiu · Yi-Ting Chen · Ching-Chun Huang

[ ExHall D ]

Abstract
This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird’s-eye views. The frontal view contains rich but sensitive height information, whereas the bird’s-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research.
Poster
Ting Li · Mao Ye · Tianwen Wu · Nianxin Li · Shuaifeng Li · Song Tang · Luping Ji

[ ExHall D ]

Abstract
Thermal object detection is a critical task in various fields, such as surveillance and autonomous driving. Current state-of-the-art (SOTA) models always leverage a prior Thermal-To-Visible (T2V) translation model to obtain visible spectrum information, followed by a cross-modality aggregation module to fuse information from both modalities. However, this fusion approach does not fully exploit the complementary visible spectrum information beneficial for thermal detection. To address this issue, we propose a novel cross-modal fusion method called Pesudo Visible Feature Fine-Grained Fusion (PFGF). Unlike previous cross-modal fusion methods, our approach explicitly models high-level relationships between cross-modal data, effectively fusing different granularity information. Specifically, a graph is constructed with nodes generated from features at different levels of the backbone, along with pseudo-visual latent features produced by the T2V model. Each feature corresponds to a subgraph. An Inter-Mamba block is proposed to perform cross-modality fusion between nodes at the lowest scale; while a Cascade Knowledge Integration (CKI) strategy is used to fuse low-scale fused information to high-scale subgraphs in a cascade manner. After several iterations of graph node updating, each subgraph outputs an aggregated feature to the detection head respectively. Experimental results demonstrate that our method achieves SOTA detection performance and is more efficient. Code, …
Poster
Konyul Park · Yecheol Kim · Daehun Kim · Jun Won Choi

[ ExHall D ]

Abstract
Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as Immortal, which can achieve robust performance through a mixture of experts approach. Our Immortal fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Mixture of Modal Experts (MoME) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of Immortal on the nuScenes-R benchmark. Our Immortal achieved state-of-the-art performance in extreme weather and sensor …
Poster
chaocan xue · Bineng Zhong · Qihua Liang · Yaozong Zheng · Ning Li · Yuanliang Xue · Shuxiang Song

[ ExHall D ]

Abstract
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize redundant ViT layers dynamically. Our approach disables a large number of representation-similar layers, and selectively retains only a single optimal layer among them for alleviating precision drop. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision.
Poster
Vladimir Yugay · Theo Gevers · Martin R. Oswald

[ ExHall D ]

Abstract
Simultaneous localization and mapping (SLAM) systems with novel view synthesis capabilities are widely used in computer vision, with applications in augmented reality, robotics, and autonomous driving. However, existing approaches are limited to single-agent operation. Recent work has addressed this problem using a distributed neural scene representation. Unfortunately, existing methods are slow, cannot accurately render real-world data, are restricted to two agents, and have limited tracking accuracy. In contrast, we propose a rigidly deformable 3D Gaussian-based scene representation that dramatically speeds up the system. However, improving tracking accuracy and reconstructing a globally consistent map from multiple agents remains challenging due to trajectory drift and discrepancies across agents' observations. Therefore, we propose new tracking and map-merging mechanisms and integrate loop closure in the Gaussian-based SLAM pipeline. We evaluate \ours on synthetic and real-world datasets and find it more accurate and faster than the state of the art.
Poster
ZaiPeng Duan · Xuzhong Hu · Pei An · Junfeng Ding · Jie Zhan · Chenxu Dang · Yunbiao Xu · Jie Ma

[ ExHall D ]

Abstract
Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/
Poster
Ziyue Zhu · Shenlong Wang · Jin Xie · Jiang-Jiang Liu · Jingdong Wang · Jian Yang

[ ExHall D ]

Abstract
Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i)Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancingperformance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation.
Poster
Sicheng Zuo · Wenzhao Zheng · Yuanhui Huang · Jie Zhou · Jiwen Lu

[ ExHall D ]

Abstract
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world-model-based framework to exploit scene evolutions for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) Ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly-observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolutions in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes datasets. Our GaussianWorld improves the performance of the single-frame counterpart by over 2\% in mIoU without introducing additional computations.
Poster
Chensheng Peng · Chengwei Zhang · Yixiao Wang · Chenfeng Xu · Yichen Xie · Wenzhao Zheng · Kurt Keutzer · Masayoshi Tomizuka · Wei Zhan

[ ExHall D ]

Abstract
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations.
Poster
Runjian Chen · Wenqi Shao · Bo Zhang · Shaoshuai Shi · Li Jiang · Ping Luo

[ ExHall D ]

Abstract
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM, shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
Poster
Yifan Chang · Junjie Huang · Xiaofeng Wang · Yun Ye · Zhujin LIANG · Yi Shan · Dalong Du · Xingang Wang

[ ExHall D ]

Abstract
Monocular 3D lane detection is a fundamental task in autonomous driving. Although sparse-point methods lower computational load and maintain high accuracy in complex lane geometries, current methods fail to fully leverage the geometric structure of lanes in both lane geometry representations and model design. In lane geometry representations, we present a theoretical analysis alongside experimental validation to verify that current sparse lane representation methods contain inherent flaws, resulting in potential errors of up to 20 meters, which raise significant safety concerns for driving. To address this issue, we propose a novel patching strategy to completely represent the full lane structure. To enable existing models to match this strategy, we introduce the EndPoint head (EP-head), which adds a patching distance to endpoints. The EP-head enables the model to predict more complete lane representations even with fewer preset points, effectively addressing existing limitations and paving the way for models that are faster and require fewer parameters in the future. In model design, to enhance the model's perception of lane structures, we propose the PointLane attention (PL-attention), which incorporates prior geometric knowledge into the attention mechanism. Extensive experiments demonstrate the effectiveness of the proposed methods on various state-of-the-art models. For instance, our methods …
Poster
Zehao Zhu · Yuliang Zou · Chiyu “Max” Jiang · Bo Sun · Vincent Casser · XIUKUN HUANG · Jiahao Wang · Zhenpei Yang · Ruiqi Gao · Leonidas Guibas · Mingxing Tan · Dragomir Anguelov

[ ExHall D ]

Abstract
Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning "empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel …
Poster
Xinyuan Chang · Maixuan Xue · Xinran Liu · Zheng Pan · Xing Wei

[ ExHall D ]

Abstract
Ensuring adherence to traffic sign regulations is essential for both human and autonomous vehicle navigation. While current online mapping solutions often prioritize the construction of the geometric and connectivity layers of HD maps, overlooking the construction of the traffic regulation layer within HD maps. Addressing this gap, we introduce MapDR, a novel dataset designed for the extraction of Driving Rules from traffic signs and their association with vectorized, locally perceived HD Maps. MapDR features over 10,000 annotated video clips that capture the intricate correlation between traffic sign regulations and lanes. Built upon this benchmark and the newly defined task of integrating traffic regulations into online HD maps, we provide modular and end-to-end solutions: VLE-MEE and RuleVLM, offering a strong baseline for advancing autonomous driving technology. It fills a critical gap in the integration of traffic sign rules, contributing to the development of reliable autonomous driving systems.
Poster
Junhao Xu · Yanan Zhang · Zhi Cai · Di Huang

[ ExHall D ]

Abstract
Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as CoSDH. By modeling the supply-demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate-late hybrid collaboration mode, where late-stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that CoSDH achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code and model will be publicly available.
Poster
Yanhao Wu · Haoyang Zhang · Tianwei Lin · Alan Huang · Shujie Luo · Rui Wu · Congpei Qiu · Wei Ke · Tong Zhang

[ ExHall D ]

Abstract
Generative models in Autonomous Driving (AD) enable diverse scenario creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenarios for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenarios over extended sequences, ensuring multimodal consistency and offering fine-grained control over scenario elements. Visualization of generated multimodal driving scenes can be found in supplementary materials.
Poster
Bozhou Zhang · Nan Song · Xin Jin · Li Zhang

[ ExHall D ]

Abstract
End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention.Current methods aggregate historical information either through dense historical bird’s-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection.However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of "future is a continuation of past'', we propose **BridgeAD**, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate …
Poster
Wenzhuo Liu · Wenshuo Wang · Yicheng Qiao · Qiannan Guo · Jiayin Zhu · Pengfei Li · Zilong Chen · Huiming Yang · Zhiwei Li · Lening Wang · Tiao Tan · Huaping Liu

[ ExHall D ]

Abstract
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks.
Poster
Xinhao Liu · Jintong Li · Yicheng Jiang · Niranjan Sujay · Zhicheng Yang · Juexiao Zhang · John Abanes · Jing Zhang · Chen Feng

[ ExHall D ]

Abstract
Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings.
Poster
Mohamed Aghzal · Xiang Yue · Erion Plaku · Ziyu Yao

[ ExHall D ]

Abstract
Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce **PathEval**, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to …
Poster
Sheng Fan · Rui Liu · Wenguan Wang · Yi Yang

[ ExHall D ]

Abstract
Navigation instruction generation (NIG), which offers interactive feedback and guidance to humans along a trajectory, is essential for developing embodied agents capable of human-machine communication and collaboration through natural language. Early data-driven methods directly map sequences of RGB frames to route descriptions on limited datasets. While recent approaches leverage Large Language Models (LLMs) to improve NIG, they often overlook the map representation of the navigation environment, which encodes multi-view semantic and topological information along the trajectory. Instead of solely inputting textual descriptions of the map into LLMs, we propose a scene map-based prompt tuning framework for NIG, \textsc{MAPInstructor}, which incorporates map priors for parameter-efficient updating of LLMs. \textsc{MAPInstructor} consists of (i) scene representation encoding, where egocentric observations are projected into 3D voxels for finer-grained scene understanding; (ii) mapping prompt tuning, which integrates the topological map representation of the entire route into the LLM-based decoder; and (iii) landmark uncertainty assessment, which reduces hallucinations in landmark prediction and further enhances instruction generation. Extensive experiments on three navigation datasets (i.e., R2R, REVERIE, RxR) confirm the generalization and effectiveness of our framework. Our code will be released.
Poster
Manon Dampfhoffer · Thomas Mesquida · Damien Joubert · Thomas Dalgaty · Pascal Vivet · Christoph Posch

[ ExHall D ]

Abstract
Event-based cameras asynchronously detect changes in light intensity with high temporal resolution, making them a promising alternative to RGB camera for low-latency and low-power optical flow estimation. However, state-of-the-art convolutional neural network methods create frames from the event stream, therefore losing the opportunity to exploit events for both sparse computations and low-latency prediction. On the other hand, asynchronous event graph methods could leverage both, but at the cost of avoiding any form of time accumulation, which limits the prediction accuracy. In this paper, we propose to break this accuracy-latency trade-off with a novel architecture combining an asynchronous accumulation-free event branch and a periodic aggregation branch. The periodic branch performs feature aggregations on the event graphs of past data to extract global context information, which improves accuracy without introducing any latency.The solution could predict optical flow per event with a latency of tens of microseconds on asynchronous hardware, which represents a gain of three orders of magnitude with respect to state-of-the-art frame-based methods, with 48x less operations per second. We show that the solution can detect rapid motion changes faster than a periodic output. This work proposes, for the first time, an effective solution for ultra low-latency and low-power optical flow …
Poster
Enshen Zhou · Qi Su · Cheng Chi · Zhizheng Zhang · Zhongyuan Wang · Tiejun Huang · Lu Sheng · He Wang

[ ExHall D ]

Abstract
Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.
Poster
Raktim Gautam Goswami · Prashanth Krishnamurthy · Yann LeCun · Farshad Khorrami

[ ExHall D ]

Abstract
Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot’s physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot’s physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot’s joints and pre-train an encoder-predictor model to infer the joints’ embeddings from surrounding unmasked regions, enhancing the encoder’s understanding of the robot’s physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.
Poster
Weijie Zhou · Manli Tao · Chaoyang Zhao · Haiyun Guo · Honghui Dong · Ming Tang · Jinqiao Wang

[ ExHall D ]

Abstract
Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and …
Poster
Ruihai Wu · Ziyu Zhu · Yuran Wang · Yue Chen · Jiarui Wang · Hao Dong

[ ExHall D ]

Abstract
Cluttered garments manipulation poses significant challenges in robotics due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, with novel designs for the awareness of garment geometry, structure, and inter-object relations. Additionally, we introduce an adaptation module, informed by learned affordance, to reorganize cluttered garments into configurations conducive to manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile scenarios in both simulation and the real world.
Poster
Jiange Yang · Haoyi Zhu · Yating Wang · Gangshan Wu · Tong He · Limin Wang

[ ExHall D ]

Abstract
Learning from multiple domains is a primary factor that influences the generalization of a single unified robot system. In this paper, we aim to learn the trajectory prediction model by using broad out-of-domain data to improve its performance and generalization ability. Trajectory model is designed to predict any-point trajectories in the current frame given an instruction and can provide detailed control guidance for robotic policy learning. To handle the diverse out-of-domain data distribution, we propose a sparsely-gated MoE (\textbf{Top-1} gating strategy) architecture for trajectory model, coined as \textbf{Tra-MoE}. The sparse activation design enables good balance between parameter cooperation and specialization, effectively benefiting from large-scale out-of-domain data while maintaining constant FLOPs per token. In addition, we further introduce an adaptive policy conditioning technique by learning 2D mask representations for predicted trajectories, which is explicitly aligned with image observations to guide action prediction more flexibly. We perform extensive experiments on both simulation and real-world scenarios to verify the effectiveness of Tra-MoE and adaptive policy conditioning technique. We also conduct a comprehensive empirical study to train Tra-MoE, demonstrating that our Tra-MoE consistently exhibits superior performance compared to the dense baseline model, even when the latter is scaled to match Tra-MoE's parameter count.
Poster
Shijie Wu · Yihang Zhu · Yunao Huang · Kaizhen Zhu · Jiayuan Gu · Jingyi Yu · Ye Shi · Jingya Wang

[ ExHall D ]

Abstract
Diffusion-based policies have shown impressive performance in robotic manipulation tasks while struggling with out-of-domain distributions. Recent efforts attempted to enhance generalization by improving the visual feature encoding for diffusion policy. However, their generalization is typically limited to the same category with similar appearances. Our key insight is that leveraging affordances—manipulation priors that define "where" and "how" an agent interacts with an object—can substantially enhance generalization to entirely unseen object instances and categories. We introduce the Diffusion Policy with transferable Affordance (AffordDP), designed for generalizable manipulation across novel categories. AffordDP models affordances through 3D contact points and post-contact trajectories, capturing the essential static and dynamic information for complex tasks. The transferable affordance from in-domain data to unseen objects is achieved by estimating a 6D transformation matrix using foundational vision models and point cloud registration techniques. More importantly, we incorporate affordance guidance during diffusion sampling that can refine action sequence generation. This guidance directs the generated action to gradually move towards the desired manipulation for unseen objects while keeping the generated action within the manifold of action space. Experimental results from both simulated and real-world environments demonstrate that AffordDP consistently outperforms previous diffusion-based methods, successfully generalizing to unseen instances and categories where …
Poster
Xia Wenke · Ruoxuan Feng · Dong Wang · Di Hu

[ ExHall D ]

Abstract
Building a generalizable self-correction system as human cognition is crucial for robots to recover from failures.Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating this semantic reflection into "how to correct" fine-grained robotic actions remains a significant challenge.To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework,we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding "how to correct" fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction.By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion refinement model and facilitate precise, fine-grained robotic action correction.Utilizing this framework, we further develop a continual learning method to automatically improve the model's capability from interactions with dynamic environments.The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks.
Poster
Kailin Li · Puhao Li · Tengyu Liu · Yuyang Li · Siyuan Huang

[ ExHall D ]

Abstract
Human hands are central to environmental interactions, motivating increasing research on dexterous robotic manipulation. Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. This paper introduces ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. ManipTrans first pre-trains a generalist trajectory imitator for mimicking hand motion, then fine-tunes a specialist residual module for each skill under physics-based interaction constraints. This approach decouples motion imitation from physical effects, enabling efficient learning and accurate execution of complex bimanual tasks.Experiments show ManipTrans surpasses state-of-the-art methods in success rate, fidelity, and efficiency. Using ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing. DexManipNet includes 3.3K episodes of robotic manipulation and supports easy expansion with our framework, facilitating further policy training for dexterous hands and enabling real-world applications. Both code and dataset will be publicly available.
Poster
Mingju Gao · Yike Pan · Huan-ang Gao · Zongzheng Zhang · Wenyi Li · Hao Dong · Hao Tang · Li Yi · Hao Zhao

[ ExHall D ]

Abstract
As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model’s understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models will be made publicly available to facilitate future research.
Poster
Jinlu Zhang · Yixin Chen · Zan Wang · Jie Yang · Yizhou Wang · Siyuan Huang

[ ExHall D ]

Abstract
Recent advances in 3D human-centric generation have made significant progress. However, existing methods still struggle with generating novel Human-Object Interactions (HOIs), particularly for open-set objects. We identify three main challenges of this task: precise human object relation reasoning, adaptive affordance parsing for unseen objects, and realistic human pose synthesis that aligns with the description and 3D object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework, leveraging the knowledge from large-scale pretrained models without training from specific datasets. More specifically, we first generate an initial human pose by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. We then utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. Finally, we introduce a detailed optimization to generate fine-grained, precise and natural interaction, enforcing realistic 3D contact between the involved body parts, including hands in grasp, and 3D object. This is achieved by distilling relational feedback from LLMs to capture detailed human-object relations from the text inputs. Extensive experiments validate the effectiveness of our approach compared to prior work, particularly in terms of the fine-grained nature of interactions …
Poster
Aditya Prakash · Benjamin E Lundell · Dmitry Andreychuk · David Forsyth · Saurabh Gupta · Harpreet S. Sawhney

[ ExHall D ]

Abstract
We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input.Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.
Poster
Liang Pan · Zeshi Yang · Zhiyang Dou · Wenjia Wang · Buzhen Huang · Bo Dai · Taku Komura · Jingbo Wang

[ ExHall D ]

Abstract
The synthesis of realistic and physically plausible human-scene interaction animations presents a critical and complex challenge in computer vision and embodied AI. Recent advances primarily focus on developing specialized character controllers for individual interaction tasks, such as contacting and carrying, often overlooking the need to establish a unified policy for versatile skills. This limitation hinders the ability to generate high-quality motions across a variety of challenging human-scene interaction tasks that require the integration of multiple skills, e.g., walking to a chair and sitting down while carrying a box. To address this issue, we present TokenHSI, a unified controller designed to synthesize various types of human-scene interaction animations. The key innovation of our framework is the use of tokenized proprioception for the simulated character, combined with various task observations, complemented by a masking mechanism that enables the selection of tasks on demand. In addition, our unified policy network is equipped with flexible input size capabilities, enabling efficient adaptation of learned foundational skills to new environments and tasks. By introducing additional input tokens to the pre-trained policy, we can not only modify interaction targets but also integrate learned skills to address diverse challenges. Overall, our framework facilitates the generation of a wide …
Poster
Yumeng Liu · Xiaoxiao Long · Zemin Yang · Yuan Liu · Marc Habermann · Christian Theobalt · Yuexin Ma · Wenping Wang

[ ExHall D ]

Abstract
Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task.Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes.Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions.Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content.We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions.
Poster
Anna Manasyan · Maximilian Seitzer · Filip Radovic · Georg Martius · Andrii Zadaianchuk

[ ExHall D ]

Abstract
Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.
Poster
Sirui Xu · Dongting Li · Yucheng Zhang · Xiyan Xu · Qi Long · Ziyin Wang · Yunzhi Lu · Shuchang Dong · Hezi Jiang · Akshat Gupta · Yu-Xiong Wang · Liangyan Gui

[ ExHall D ]

Abstract
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remains challenging due to dataset limitations. These datasets often lack extensive, high-quality text-interaction pair data and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark with key contributions in both dataset and methodology. First, we consolidate 21.81 hours of HOI data from diverse sources, standardizing and enriching them with detailed textual annotations. Second, we propose a unified optimization framework that enhances data quality by minimizing artifacts and restoring hand motions. Leveraging the insight of contact invariance, we preserve human-object relationships while introducing motion variations, thereby expanding the dataset to 30.70 hours. Third, we introduce six tasks to benchmark existing methods and develop a unified HOI generative model based on multi-task learning that achieves state-of-the-art results. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. The dataset will be publicly accessible to support further research in the field.
Poster
Prithviraj Banerjee · Sindi Shkodrani · Pierre Moulon · Shreyas Hampali · Shangchen Han · Fan Zhang · Linguang Zhang · Jade Fountain · Edward Miller · Selen Basol · Richard Newcombe · Robert Wang · Jakob Engel · Tomas Hodan

[ ExHall D ]

Abstract
We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking …
Poster
Brent Yi · Vickie Ye · Maya Zheng · Yunqi Li · Lea Müller · Georgios Pavlakos · Yi Ma · Jitendra Malik · Angjoo Kanazawa

[ ExHall D ]

Abstract
We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%.
Poster
Huakun Liu · Hiroki Ota · Xin Wei · Yutaro Hirao · Monica Perusquia-Hernandez · Hideaki Uchiyama · Kiyoshi Kiyokawa

[ ExHall D ]

Abstract
Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy. The code will be available for research purposes.
Poster
Jihyun Lee · Weipeng Xu · Alexander Richard · Shih-En Wei · Shunsuke Saito · Shaojie Bai · Te-Li Wang · Minhyuk Sung · Tae-Kyun Kim · Jason Saragih

[ ExHall D ]

Abstract
We present a novel framework for egocentric 3D motion tracking, focusing on learning high-fidelity social motions from head-mounted camera image inputs. Our approach leverages a motion diffusion model based on cascaded body-hand diffusion denoising for accurate whole-body motion tracking. We build our network upon a modified Transformer-based architecture using windowed relative-temporal attention to achieve better generalizability to arbitrary-length motions. Additionally, we propose a novel exemplar-based identity conditioning method to further boost tracking quality when prior information (i.e., anchor poses) about the target identity is available. In experiments, our framework achieves state-of-the-art whole-body motion tracking accuracy while enabling real-time inference (> 60 FPS) through diffusion distillation. Our supplementary video demonstrates that our approach can estimate significantly more natural social motions from challenging (e.g., occluded or truncated) egocentric input observations compared to the existing state-of-the-arts.
Poster
Xiaoning Sun · Dong Wei · Huaijiang Sun · Shengxiang Hu

[ ExHall D ]

Abstract
Making accurate prediction of human motions based on the historical observation is a crucial technology for robots to collaborate with humans. Existing human motion prediction methods are all built under an ideal assumption that robots can instantaneously react, which ignores the time delay introduced during data processing & analysis and future reaction planning -- jointly known as "response latency''. Consequently, the predictions made within this latency period become meaningless for practical use, as part of the time has passed and the corresponding real motions have already occurred before robot deliver its reaction. In this paper, we argue that the seemingly meaningless prediction period, however, can be leveraged to enhance prediction accuracy significantly. We propose LAL, a Latency-aware Auxiliary Learning framework, which shifts the existing "reaction instantaneous'' convention into a new motion prediction paradigm with both latency compatibility and utility. The framework consists of two branches handling different tasks: the primary branch learns to directly predict the valid target (excluding the beginning latency period) based on observation; while the auxiliary branch learns the same target, but based on the reformed observation with additional latency data incorporated. A direct and effective way of auxiliary feature sharing is forced by our tailored consistency …
Poster
Daniel Etaat · Dvij Rajesh Kalaria · Nima Rahmanian · Shankar Sastry

[ ExHall D ]

Abstract
Table tennis is beyond a game of physical skill. Competitive players overcome the challenging high-speed and highly dynamic environment by anticipating their opponent’s intent – buying themselves the necessary time to respond. In this work, we take one step towards designing such an agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation; of the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D, (2) a real-time Transformer-based model for anticipating opponent actions, and (3) an uncertainty-aware anticipatory controller for a KUKA robot arm. We demonstrate in simulation that our policy improves the ball return rate on high-speed hits from 44.3% to 61.1% as compared to a baseline non-anticipatory policy.
Poster
Sanjay Subramanian · Evonne Ng · Lea Müller · Dan Klein · Shiry Ginosar · Trevor Darrell

[ ExHall D ]

Abstract
We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation.We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.
Poster
Mingzhen Huang · Fu-Jen Chu · Bugra Tekin · Kevin Liang · Haoyu Ma · Weiyao Wang · Xingyu Chen · Pierre Gleize · Hongfei Xue · Siwei Lyu · Kris Kitani · Matt Feiszli · Hao Tang

[ ExHall D ]

Abstract
We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.
Poster
Yuan Wang · Yali Li · Lixiang Li · Shengjin Wang

[ ExHall D ]

Abstract
While flourishing developments have been witnessed in text-to-motion generation, synthesizing physically realistic, controllable, language-conditioned Human Scene Interactions (HSI) remains a relatively underexplored landscape. Current HSI methods naively rely on conditional Variational AutoEncoder (cVAE) and diffusion models. They are typically associated with \textbf{limited modalities of control signals} and \textbf{task-specific frameworks design}, leading to inflexible adaptation across various interaction scenarios and descriptive-unfaithful motions in diverse 3D physical environments. In this paper, we propose HSI-GPT, a General-Purpose \textbf{Large Scene-Motion-Language Model} that applies next-token prediction'' paradigm of Large Language Models to the HSI domain. HSI-GPT not only exhibits remarkable flexibility to accommodate diverse control signals (3D scenes, textual commands, key-frame poses, as well as scene affordances), but it seamlessly supports various HSI-related tasks (\textit{e.g}., multi-modal controlled HSI generation, HSI understanding, and general motion completion in 3D scenes). First, HSI-GPT quantizes textual descriptions and human motions into discrete, LLM-interpretable tokens with multi-modal tokenizers. Inspired by multi-modal learning, we develop a recipe for aligning mixed-modality tokens into the shared embedding space of LLMs. These interaction tokens are then organized into unified instruction following prompts, allowing our HSI-GPT to fine-tune on prompt-based question-and-answer tasks. Extensive experiments and visualizations validate that our general-purpose HSI-GPT model delivers exceptional performance …
Poster
Seokhyeon Hong · Chaelin Kim · Serin Yoon · Junghyun Nam · Sihun Cha · Junyong Noh

[ ExHall D ]

Abstract
Text-driven motion generation has been significantly advanced with the rise of denoising diffusion models.However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions.Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning.In this paper, we introduce a skeleton-aware latent diffusion~(SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words.Furthermore, by leveraging cross-attention maps produced during the generation process, we enable the first zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts.Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation.
Poster
Yabiao Wang · Shuo Wang · Jiangning Zhang · Ke Fan · Jiafu Wu · Xuezhucun Xue · Yong Liu

[ ExHall D ]

Abstract
Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns.Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. The project code will be released upon acceptance.
Poster
Qihang Fang · Chengcheng Tang · Bugra Tekin · Shugao Ma · Yanchao Yang

[ ExHall D ]

Abstract
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding.
Poster
Jiayi Gao · Zijin Yin · Changcheng Hua · Yuxin Peng · Kongming Liang · Zhanyu Ma · Jun Guo · Yang Liu

[ ExHall D ]

Abstract
The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current techniques struggle to preserve the diversity and accuracy of motion while transferring it to subjects with varying shapes. Additionally, they fail to handle videos with multiple subjects, making it difficult to disentangle and control motion transfer to specific instances.To overcome these challenges, we introduce ConMo, a training-free framework that disentangle and recompose the motions of subjects and camera movements. Unlike existing methods, ConMo separates individual subject motions and background motions from compound motion trajectories in complex videos, allowing for precise control and flexible manipulation, such as resizing and repositioning. This leads to more accurate motion transfer across subjects of different shapes and better handling of videos with multiple subjects. Extensive experiments demonstrate that ConMo can effectively achieve the aforementioned various applications and surpass existing state-of-the-art methods in terms of motion fidelity and temporal consistency. The code will be published.
Poster
Mingzhe Guo · Weiping Tan · Wenyu Ran · Liping Jing · Zhipeng Zhang

[ ExHall D ]

Abstract
Aiming to achieve class-agnostic perception in visual object tracking, current trackers commonly formulate tracking as a one-shot detection problem with the template-matching architecture. Despite the success, severe environmental variations in long-term tracking raise challenges to generalizing the tracker in novel situations. Temporal trackers try to fix it by preserving the time-validity of target information with historical predictions, e.g., updating the template. However, solely transmitting the previous observations instead of learning from them leads to an inferior capability of understanding the tracking scenario from past experience, which is critical for the generalization in new frames. To address this issue, we reformulate temporal learning in visual tracking as a History-to-Future process and propose a novel tracking framework DreamTrack. Our DreamTrack learns the temporal dynamics from past observations to dream the future variations of the environment, which boosts the generalization with the extended future information from history. Considering the uncertainty of future variation, multimodal prediction is designed to infer the target trajectory of each possible future situation. Without bells and whistles, DreamTrack achieves leading performance with real-time inference speed. In particular, DreamTrack obtains SUC scores of 76.6%/87.9% on LaSOT/TrackingNet, surpassing all recent SOTA trackers.
Poster
Seokju Cho · Gabriel Huang · Seungryong Kim · Joon-Young Lee

[ ExHall D ]

Abstract
Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.
Poster
Jiaqi Li · Yiran Wang · Jinghong Zheng · Junrui Zhang · Liao Shen · Tianqi Liu · Zhiguo Cao

[ ExHall D ]

Abstract
Depth estimation is a fundamental task in 3D vision. An ideal depth estimation model is expected to embrace meticulous detail, temporal consistency, and high efficiency. Although existing foundation models can perform well in certain specific aspects, most of them fall short of fulfilling all the above requirements simultaneously. In this paper, we present CH3Depth, an efficient and flexible model for depth estimation with flow matching to address this challenge. Specifically, 1) we reframe the optimization objective of flow matching as a scale-variable velocity field to improve accuracy. 2) To enhance efficiency, we propose non-uniform sampling to achieve better prediction with fewer sampling steps. 3) We design the Latent Temporal Stabilizer (LTS) to enhance temporal consistency by aggregating latent codes of adjacent frames, enabling our method to be lightweight and compatible for video depth estimation. CH3Depth achieves state-of-the-art performance in zero-shot evaluations across multiple image and video datasets, excelling in prediction accuracy, efficiency, and temporal consistency, highlighting its potential as the next foundation model for depth estimation.
Poster
Bingxin Ke · Dominik Narnhofer · Shengyu Huang · Lei Ke · Torben Peters · Katerina Fragkiadaki · Anton Obukhov · Konrad Schindler

[ ExHall D ]

Abstract
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call \textbf{RollingDepth}, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able …
Poster
Wonyong Seo · Jihyong Oh · Munchurl Kim

[ ExHall D ]

Abstract
Existing Video Frame interpolation (VFI) models tend to suffer from time-to-location ambiguity when trained with video of non-uniform motions, such as accelerating, decelerating, and changing directions, which often yield blurred interpolated frames.In this paper, we propose (i) a novel motion description map, Bidirectional Motion field (BiM), to effectively describe non-uniform motions; (ii) a BiM-guided Flow Net (BiMFN) with Content-Aware Upsampling Network (CAUN) for precise optical flow estimation; and (iii) Knowledge Distillation for VFI-centric Flow supervision (KDVCF) to supervise the motion estimation of VFI model with VFI-centric teacher flows.The proposed VFI is called a Bidirectional Motion field-guided VFI (BiM-VFI) model.Extensive experiments show that our BiM-VFI model significantly surpasses the recent state-of-the-art VFI methods by 26\% and 45\% improvements in LPIPS and STLPIPS respectively, yielding interpolated frames with much fewer blurs at arbitrary time instances.
Poster
Shiyi Liang · Yifan Bai · Yihong Gong · Xing Wei

[ ExHall D ]

Abstract
Recent advancements in visual object tracking have shifted towards a sequential generation paradigm, where object deformation and motion exhibit strong temporal dependencies. Despite the importance of these dependencies, widely adopted image-level pretrained backbones barely capture the dynamics in the consecutive video, which is the essence of tracking. Thus, we propose AutoRegressive Generative Pretraining (ARG), an unsupervised spatio-temporal learner, via generating the evolution of object appearance and motion in video sequences. Our method leverages a diffusion model to autoregressively generate the future frame appearance, conditioned on historical embeddings extracted by a general encoder. Furthermore, to ensure trajectory coherence, the same encoder is employed to learn trajectory consistency by generating coordinate sequences in a reverse autoregressive fashion, a process we term back-tracking. Further, we integrate the pretrained ARG into ARTrackV2, creating ARGTrack, which is further fine-tuned for tracking tasks. ARGTrack achieves state-of-the-art performance across multiple benchmarks, becoming the first tracker to surpass 80% AO on GOT-10k, while maintaining high efficiency. These results demonstrate the effectiveness of our approach in capturing temporal dependencies for continuous video tracking. The code will be released soon.
Poster
Yuyang Huang · Yabo Chen · Li Ding · Xiaopeng Zhang · Wenrui Dai · Junni Zou · Hongkai Xiong · Qi Tian

[ ExHall D ]

Abstract
Controllability of video generation has been recently concerned in addition to the quality of generated videos. The main challenge to controllable video generation is to synthesize videos based on user-specified instance spatial locations and movement trajectories. However, existing methods suffer from a dilemma between the resource consumption, generation quality, and user controllability. As an efficient alternative to prohibitive training-based video generation, existing zero-shot video generation methods cannot generate high-quality and motion-consistent videos under the control of layouts and movement trajectories. In this paper, we propose a novel zero-shot method named IM-Zero that ameliorates instance-level motion controllable video generation with enhanced control accuracy, motion consistency, and richness of details to address this problem. Specifically, we first present a motion generation stage that extracts motion and textural guidance from keyframe candidates from pre-trained grounded text-to-image model to generate the desired coarse motion video. Subsequently, we develop a video refinement stage that injects the motion priors of pre-trained text-to-video models and detail priors of pre-trained text-to-image models into the latents of coarse motion videos to further enhance video motion consistency and richness of details. To our best knowledge, IM-Zero is the first to simultaneously achieve high-quality video generation and allow to control both …
Poster
Hyeonho Jeong · Chun-Hao P. Huang · Jong Chul Ye · Niloy J. Mitra · Duygu Ceylan

[ ExHall D ]

Abstract
While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Anonymous project page: https://anonymous1anonymous2.github.io/
Poster
Xin Ma · Yaohui Wang · Gengyun Jia · Xinyuan Chen · Tien-Tsin Wong · Yuan-Fang Li · Cunjian Chen

[ ExHall D ]

Abstract
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and objects in the input static image) and ensuring motion smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better image consistency and motion smoothness. In general, we propose two effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
Poster
Peiqing Yang · Shangchen Zhou · Jixin Zhao · Qingyi Tao · Chen Change Loy

[ ExHall D ]

Abstract
Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To tackle this, we propose MatAnyone, a practical framework designed for target-assigned video matting. Specifically, building on a memory-based framework, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively combines memory from the previous frame. This ensures stable semantic consistency in core regions while maintaining fine details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, further improving matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust, accurate video matting in diverse real-world scenarios, outperforming existing methods. The code and model will be publicly available.
Poster
Zhongrui Yu · Martina Megaro-Boldini · Robert Sumner · Abdelaziz Djelouah

[ ExHall D ]

Abstract
Extending the field of view of video content beyond its original version has many applications: immersive viewing experience with VR devices, reformatting 4:3 legacy content to today's viewing conditions with wide screens, or simply extending vertically captured phone videos. Many existing works focus on synthesizing the video using generative models only. Despite promising results, this strategy seems at the moment limited in terms of quality. In this work, we address this problem using two key ideas: 3D supported outpainting for the static regions of the images, and leveraging pre-trained video diffusion model to ensure realistic and temporally coherent results, particularly for the dynamic parts. In the first stage, we iterate between image outpainting and updating the 3D scene representation - we use 3D Gaussian Splatting. Then we consider dynamic objects independently per frame and inpaint missing pixels. Finally, we propose a denoising scheme that allows to maintain known reliable regions and update the dynamic parts to obtain temporally realistic results.We achieve state-of-the-art video outpainting. This is validated quantitatively and through a user study. We are also able to extend the field of view largely beyond the limits reached by existing methods.
Poster
Zhaoyi Tian · Feifeng Wang · Shiwei Wang · Zihao Zhou · Yao Zhu · Liquan Shen

[ ExHall D ]

Abstract
Recently, learned video compression (LVC) is undergoing a period of rapid development. However, due to absence of large and high-quality high dynamic range (HDR) video training data, LVC on HDR video is still unexplored. In this paper, we are the first to collect a large-scale HDR video benchmark dataset, named HDRVD2K, featuring huge quantity, diverse scenes and multiple motion types. HDRVD2K fills gaps of video training data and facilitate the development of LVC on HDR videos. Based on HDRVD2K, we further propose the first learned bit-depth scalable video compression (LBSVC) network for HDR videos by effectively exploiting bit-depth redundancy between videos of multiple dynamic ranges. To achieve this, we first propose a compression-friendly bit-depth enhancement module (BEM) to effectively predict original HDR videos based on compressed tone-mapped low dynamic range (LDR) videos and dynamic range prior, instead of reducing redundancy only through spatio-temporal predictions. Our method greatly improves the reconstruction quality and compression performance on HDR videos. Extensive experiments demonstrate the effectiveness of HDRVD2K on learned HDR video compression and great compression performance of our proposed LBSVC network.
Poster
Wei Jiang · Junru Li · Kai Zhang · Li zhang

[ ExHall D ]

Abstract
In Learned Video Compression (LVC), improving inter prediction, such as enhancing temporal context mining and mitigating accumulated errors, is crucial for boosting rate-distortion performance. Existing LVCs mainly focus on mining the temporal movements while neglecting non-local correlations among frames. Additionally, current contextual video compression models use a single reference frame, which is insufficient for handling complex movements. To address these issues, we propose leveraging non-local correlations across multiple frames to enhance temporal priors, significantly boosting rate-distortion performance. To mitigate error accumulation, we introduce a partial cascaded fine-tuning strategy that supports fine-tuning on full-length sequences with constrained computational resources. This method reduces the train-test mismatch in sequence lengths and significantly decreases accumulated errors. Based on the proposed techniques, we present a video compression scheme ECVC. Experiments demonstrate that our ECVC achieves state-of-the-art performance, reducing 10.5 and 11.5 more bit-rates than previous SOTA method DCVC-FM over VTM-13.2 low delay B (LDB) under the intra period (IP) of 32 and 1, respectively.
Poster
Gang He · Weiran Wang · Guancheng Quan · Shihao Wang · Dajiang Zhou · Yunsong Li

[ ExHall D ]

Abstract
Quality degradation from video compression manifests both spatially along texture edges and temporally with continuous motion changes. Despite recent advances, extracting aligned spatiotemporal information from adjacent frames remains challenging. This is mainly due to limitations in receptive field size and computational complexity, which makes existing methods struggle to efficiently enhance video quality. To address this issue, we propose RivuletMLP, an MLP-based network architecture. Specifically, our framework first employs a Spatiotemporal Dynamic Alignment (STDA) module to adaptively explore and align multi-frame feature information. Subsequently, we introduce two modules for feature reconstruction: a Spatiotemporal Flow Module (SFM) and a Benign Selection Compensation Module (BSCM). The SFM establishes non-local dependencies through an innovative feature permutation mechanism, recovering details while reducing computational cost. Additionally, the BSCM utilizes a collaborative learning strategy in both spatial and frequency domains to alleviate compression-induced inter-frame motion discontinuities. Experimental results demonstrate that RivuletMLP achieves superior computational efficiency while maintaining powerful reconstruction capabilities.
Poster
Feng Liu · Shiwei Zhang · Xiaofeng Wang · Yujie Wei · Haonan Qiu · Yuzhong Zhao · Yingya Zhang · Qixiang Ye · Fang Wan

[ ExHall D ]

Abstract
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising.Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps.However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality.In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps.Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost.TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching.Experiments show that TeaCache achieves up to 4.41× acceleration over Open-Sora-Plan with negligible (-0.07\% Vbench score) degradation of visual quality. Code is enclosed in the supplementary material.
Poster
Mingzhen Sun · Weining Wang · Li · Jiawei Liu · Jiahui Sun · Wanquan Feng · Shanshan Lao · SiYu Zhou · Qian HE · Jing Liu

[ ExHall D ]

Abstract
The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, by achieving competitive and state-of-the-art results …
Poster
Deyu Zhou · Quan Sun · Yuang Peng · Kun Yan · Runpei Dong · Duomin Wang · Zheng Ge · Nan Duan · Xiangyu Zhang

[ ExHall D ]

Abstract
We introduce Masked Autoregressive Video Generation (MAGI), a hybrid framework that combines the strengths of masked and causal modeling paradigms to achieve efficient video generation. MAGI utilizes masked modeling for intra-frame generation and causal modeling to capture temporal dependencies across frames. A key innovation is our Standard Teacher Forcing paradigm, which conditions masked frames on complete observation frames, enabling a smooth transition from token-level to frame-level autoregressive modeling. To mitigate issues like exposure bias, we incorporate targeted training strategies, setting a new benchmark for autoregressive video generation. Extensive experiments demonstrate that MAGI can generate long, coherent video sequences of over 100 frames, even when trained on as few as 16 frames, establishing its potential for scalable, high-quality video generation.
Poster
Shijun Shi · Jing Xu · Lijing Lu · Zhihang Li · Kai Hu

[ ExHall D ]

Abstract
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Poster
Zonghui Guo · YingJie Liu · Jie Zhang · Haiyong Zheng · Shiguang Shan

[ ExHall D ]

Abstract
Face Forgery Video Detection (FFVD) is a critical yet challenging task in determining whether a digital facial video is authentic or forged. Existing FFVD methods typically focus on isolated spatial or coarsely fused spatiotemporal information, failing to leverage temporal forgery cues thus resulting in unsatisfactory performance. We strive to unravel these cues across three progressive levels: momentary anomaly, gradual inconsistency, and cumulative distortion. Accordingly, we design a consecutive correlate module to capture momentary anomaly cues by correlating interactions among consecutive frames. Then, we devise a future guide module to unravel inconsistency cues by iteratively aggregating historical anomaly cues and gradually propagating them into future frames. Finally, we introduce a historical review module that unravels distortion cues via momentum accumulation from future to historical frames. These three modules form our Temporal Forgery Cue Unraveling (TFCU) framework, sequentially highlighting spatial discriminative features by unraveling temporal forgery cues bidirectionally between historical and future frames. Extensive experiments and ablation studies demonstrate the effectiveness of our TFCU method, achieving state-of-the-art performance across diverse unseen datasets and manipulation methods. Code will be made publicly available.
Poster
Zhiyao Wang · Xu Chen · Chengming Xu · Junwei Zhu · Xiaobin Hu · Jiangning Zhang · Chengjie Wang · Yuqi Liu · Yiyi Zhou · Rongrong Ji

[ ExHall D ]

Abstract
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines …
Poster
Xin Zhang · Xue Yang · Yuxuan Li · Jian Yang · Ming-Ming Cheng · Xiang Li

[ ExHall D ]

Abstract
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver (UCR), which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the …
Poster
Minh Kha Do · Kang Han · Phu Lai · Khoa T. Phan · Wei Xiang

[ ExHall D ]

Abstract
Foundation models for remote sensing have garnered increasing attention for their great performance across various observation tasks. However, current models lack robustness when managing diverse input types and handling incomplete data in downstream tasks. In this paper, we propose RobSense, a robust multi-modal foundation model for Multi-spectral and Synthetic Aperture Radar data. RobSense is designed with modular components and pre-trained by a combination of temporal multi-modal and masked autoencoder strategies on a huge-scale dataset. Therefore, it can effectively support diverse input types, from static to temporal, uni-modal to multi-modal. To further handle the incomplete data, we incorporate two uni-modal latent reconstructors to recover rich representations from incomplete inputs, addressing variability in spectral bands and temporal sequence irregularities. Extensive experiments show that RobSense consistently outperforms state-of-the-art baselines on the complete dataset across four input types for segmentation and classification tasks. Furthermore, the proposed model outperforms by considerably larger margins when the missing rate increases in the incomplete datasets.
Poster
Yuanye Liu · jinyang liu · Renwei Dian · Shutao Li

[ ExHall D ]

Abstract
Hyperspectral fusion imaging is challenged by high computational cost due to the abundant spectral information. We find that pixels in regions with smooth spatial-spectral structure can be reconstructed well using a shallow network, while only those in regions with complex spatial-spectral structure require a deeper network. However, existing methods process all pixels uniformly, which ignores this property. To leverage this property, we propose a Selective Re-learning Fusion Network (SRLF) that initially extracts features from all pixels uniformly and then selectively refines distorted feature points. Specifically, SRLF first employs a Preliminary Fusion Module with robust global modeling capability to generate a preliminary fusion feature. Afterward, it applies a Selective Re-learning Module to focus on improving distorted feature points in the preliminary fusion feature. To achieve targeted learning, we present a novel Spatial-Spectral Structure-Guided Selective Re-learning Mechanism (SSG-SRL) that integrates the observation model to identify the feature points with spatial or spectral distortions. Only these distorted points are sent to the corresponding re-learning blocks, reducing both computational cost and the risk of overfitting. Finally, we develop an SRLF-Net, composed of multiple cascaded SRLFs, which surpasses multiple state-of-the-art methods on several datasets with minimal computational cost.
Poster
Jie Huang · Haorui Chen · Jiaxuan Ren · Siran Peng · Liang-Jian Deng

[ ExHall D ]

Abstract
Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. To address these challenges across two key dimensions, we introduce a general adaptive dual-level weighting mechanism (ADWM), designed to enhance a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature and selectively weighs to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output by preserving key distinctions across feature depths. This dual-level weighting is efficiently implemented through our proposed Correlation-Aware Covariance Weighting (CACW), which generates weights by utilizing the correlations captured within the covariance matrix. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, ablation studies, comparison experiments, and detailed visual analysis.
Poster
Haowen Bai · Jiangshe Zhang · Zixiang Zhao · Yichen Wu · Lilun Deng · Yukun Cui · Tao Feng · Shuang Xu

[ ExHall D ]

Abstract
Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual characteristics compared to any single source, often enhancing downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility.To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the loss of downstream tasks in a meta-learning manner.The learning objective is to minimize the task loss of the fused images, once the fusion module has been optimized by the fusion loss.Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives.TDFusion’s training relies solely on the loss of downstream tasks, making it adaptable to any specific task.It can be applied to any architecture of fusion and task networks.Experiments demonstrate TDFusion’s performance in both fusion and task-related applications, including four public fusion datasets, semantic segmentation, and object detection.The code will be …
Poster
Guanglu Dong · Tianheng Zheng · Yuanzhouhan Cao · Linbo Qing · Chao Ren

[ ExHall D ]

Abstract
Recently, deep image deraining models based on paired datasets have made a series of remarkable progress. However, they cannot be well applied in real-world applications due to the difficulty of obtaining real paired datasets and the poor generalization performance. In this paper, we propose a novel Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining framework, CSUD, to tackle the aforementioned challenges. During training with unpaired data, CSUD is capable of generating high-quality pseudo clean and rainy image pairs which are used to enhance the performance of deraining network. Specifically, to preserve more image background details while transferring rain streaks from rainy images to the unpaired clean images, we propose a novel Channel Consistency Loss (CCLoss) by introducing the Channel Consistency Prior (CCP) of rain streaks into training process, thereby ensuring that the generated pseudo rainy images closely resemble the real ones. Furthermore, we propose a novel Self-Reconstruction (SR) strategy to alleviate the redundant information transfer problem of the generator, further improving the deraining performance and the generalization capability of our method. Extensive experiments on multiple synthetic and real-world datasets demonstrate that the deraining performance of CSUD surpasses other state-of-the-art unsupervised methods and CSUD exhibits superior generalization capability.
Poster
Gehui Li · Bin Chen · Chen Zhao · Lei Zhang · Jian Zhang

[ ExHall D ]

Abstract
Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domain-based methods have achieved impressive improvement, yet they still struggle with complex real-world scenarios under extreme exposure conditions. This is due to the local convolutional receptive fields failing to model long-range dependencies in the spectrum, and the non-generative learning paradigm being inadequate for retrieving lost details from severely degraded regions. In this paper, we propose **O**mnidirectional **S**pectral **Mamba** (**OSMamba**), a novel exposure correction network that incorporates the advantages of state space models and generative diffusion models to address these limitations. Specifically, OSMamba introduces an omnidirectional spectral scanning mechanism that adapts Mamba to the frequency domain to capture comprehensive long-range dependencies in both the amplitude and phase spectra of deep image features, hence enhancing illumination correction and structure recovery. Furthermore, we develop a dual-domain prior generator that learns from well-exposed images to generate a degradation-free diffusion prior containing correct information about severely under- and over-exposed regions for better detail restoration. Extensive experiments on multiple-exposure and mixed-exposure datasets demonstrate that the proposed OSMamba achieves state-of-the-art performance both quantitatively and qualitatively.
Poster
Boyun Li · Haiyu Zhao · Wenxin Wang · Peng Hu · Yuanbiao Gou · Xi Peng

[ ExHall D ]

Abstract
Recent advancements in Mamba have shown promising results in image restoration. These methods typically flatten 2D images into multiple distinct 1D sequences along rows and columns, process each sequence independently using selective scan operation, and recombine them to form the outputs. However, such a paradigm overlooks two vital aspects: i) the local relationships and spatial continuity inherent in natural images, and ii) the discrepancies among sequences unfolded through totally different ways. To overcome the drawbacks, we explore two problems in Mamba-based restoration methods: i) how to design a scanning strategy preserving both locality and continuity while facilitating restoration, and ii) how to aggregate the distinct sequences unfolded in totally different ways. To address these problems, we propose a novel Mamba-based Image Restoration model (MaIR), which consists of Nested S-shaped Scanning strategy (NSS) and Sequence Shuffle Attention block (SSA). Specifically, NSS preserves locality and continuity of the input images through the stripe-based scanning region and the S-shaped scanning path, respectively. SSA aggregates sequences through calculating attention weights within the corresponding channels of different sequences. Thanks to NSS and SSA, MaIR surpasses 40 baselines across 14 challenging datasets, achieving state-of-the-art performance on the tasks of image super-resolution, denoising, deblurring and dehazing. Our …
Poster
Yuhui Quan · Tianxiang Zheng · Zhiyuan Ma · Hui Ji

[ ExHall D ]

Abstract
The blind-spot principle has been a widely used tool in zero-shot image denoising but faces challenges with real-world noise that exhibits strong local correlations. Existing methods focus on reducing noise correlation, such as masking or reshuffling, which also weaken the pixel correlations needed for estimating missing pixels, resulting in performance degradation. In this paper, we first present a rigorous analysis of how noise correlation and pixel correlation impact the statistical risk of a linear blind-spot denoiser. We then propose using an implicit neural representation to resample noisy pixels, effectively reducing noise correlation while preserving the essential pixel correlations for successful blind-spot denoising. Extensive experiments show our method surpasses existing zero-shot denoising techniques on real-world noisy images.
Poster
Hang Xu · Jie Huang · Wei Yu · Jiangtong Tan · Zhen Zou · Feng Zhao

[ ExHall D ]

Abstract
Blind Super-Resolution(blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and overlook the need for generalization in features at intermediate layers. Without explicit regularization of features at intermediate layers, the blind SR network struggles to obtain well-generalized feature representations. However, the key challenge is that directly applying dropout to intermediate layers leads to a significant performance drop, which we attribute to the inconsistency in training-testing and across layers it introduced. Therefore, we propose Adaptive Dropout, a new regularization method for blind SR models, which mitigates the inconsistency and facilitates application across intermediate layers of networks. Specifically, for training-testing inconsistency, we re-design the form of dropout and integrate the features before and after dropout adaptively. For inconsistency in generalization requirements across different layers, we innovatively design an adaptive training strategy to strengthen feature propagation by layer-wise annealing. Experimental results show that our method outperforms all past regularization methods on both synthetic and real-world benchmark datasets, also highly effective in other …
Poster
Haijin Zeng · Xiangming Wang · Yongyong Chen · Jingyong Su · Jie Liu

[ ExHall D ]

Abstract
Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios.To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously.VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation.By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels.VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.
Poster
Xingyuan Li · Zirui Wang · Yang Zou · Zhixin Chen · Jun Ma · Zhiying Jiang · Long Ma · Jinyuan Liu

[ ExHall D ]

Abstract
Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance.
Poster
Haina Qin · Wenyang Luo · Bing Li · Weiming Hu · libin wang · DanDan Zheng · Jingdong Chen · Ming Yang

[ ExHall D ]

Abstract
Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications. Our code will be made publicly available.
Poster
Siyang Wang · Naishan Zheng · Jie Huang · Feng Zhao

[ ExHall D ]

Abstract
Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR's adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.
Poster
Chunyi Li · Yuan Tian · Xiaoyue Ling · Zicheng Zhang · Haodong Duan · Haoning Wu · Ziheng Jia · Xiaohong Liu · Xiongkuo Min · Guo Lu · Weisi Lin · Guangtao Zhai

[ ExHall D ]

Abstract
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: **Image Quality Assessment for Machine Vision** for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences.
Poster
Sidi Yang · Binxiao Huang · Yulun Zhang · Dahai Yu · Yujiu Yang · Ngai Wong

[ ExHall D ]

Abstract
While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1\% energy consumption compared to its CNN contestant DnCNN, while delivering 20× faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising.
Poster
Jiamin Xu · Yuxin Zheng · Zelong Li · Chi Wang · Renshu Gu · Weiwei Xu · Gang Xu

[ ExHall D ]

Abstract
Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.
Poster
Haonan Zhao · Qingyang Liu · Xinhao Tao · Li Niu · Guangtao Zhai

[ ExHall D ]

Abstract
Image composition involves integrating foreground object into background image to obtain a composite image. One of the key challenges is to produce realistic shadow for the inserted foreground object. Recently, diffusion-based methods have shown superior performance compared to GAN-based methods in shadow generation. However, they are still struggling to generate shadows with plausible geometry in complex cases. In this paper, we focus on promoting diffusion-based methods by leveraging geometry priors. Specifically, we first predict the rotated bounding box and matched shadow shapes for the foreground shadow. Then, the geometry information of rotated bounding box and matched shadow shapes is injected into ControlNet to facilitate shadow generation. Extensive experiments on both DESOBAv2 dataset and real composite images validate the effectiveness of our proposed method.
Poster
Liangbin Xie · Daniil Pakhomov · Zhonghao Wang · Zongze Wu · Ziyan Chen · Yuqian Zhou · Haitian Zheng · Zhifei Zhang · Zhe Lin · Jiantao Zhou · Chao Dong

[ ExHall D ]

Abstract
This paper introduces TurboFill, a fast image inpainting model that augments a few-step text-to-image diffusion model with an inpainting adapter to achieve high-quality and efficient inpainting. While standard diffusion models can generate high quality results, they incur high computational cost. We address this limitation by directly training an inpainting adapter on a few-step distilled text-to-image model, specifically DMD2, with a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which assesses model performance across variable mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill significantly outperforms both existing diffusion-based inpainting models and few-step inpainting methods, setting a new benchmark for practical high-performance inpainting tasks.
Poster
Donghui Feng · Zhengxue Cheng · Shen Wang · Ronghua Wu · Hongwei Hu · Guo Lu · Li Song

[ ExHall D ]

Abstract
Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most of recent focuses have been solely on a strong backbone, and few studies consider the low-complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact features extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively, to further improve the RD performance. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -14.84\%, -15.20\%, -17.32\% in BD-rate on Kodak, Tecnick and CLIC Professional validation datasets.
Poster
Hao Xu · Xiaolin Wu · Xi Zhang

[ ExHall D ]

Abstract
Recent research has explored integrating lattice vector quantization (LVQ) into learned image compression models. Due to its more efficient Voronoi covering of vector space than scalar quantization (SQ), LVQ achieves better rate-distortion (R-D) performance than SQ, while still retaining the low complexity advantage of SQ. However, existing LVQ-based methods have two shortcomings: 1) lack of a multirate coding mode, hence incapable to operate at different rates; 2) the use of a fixed lattice basis, hence nonadaptive to changing source distributions. To overcome these shortcomings, we propose a novel adaptive LVQ method, which is the first among LVQ-based methods to achieve both rate and domain adaptations. By scaling the lattice basis vector, our method can adjust the density of lattice points to achieve various bit rate targets, achieving superior R-D performance to current SQ-based variable rate models. Additionally, by using a learned invertible linear transformation between two different input domains, we can reshape the predefined lattice cell to better represent the target domain, further improving the R-D performance. To our knowledge, this paper represents the first attempt to propose a unified solution for rate adaptation and domain adaptation through quantizer design.
Poster
Jinrui Yang · Qing Liu · Yijun Li · Soo Ye Kim · Daniil Pakhomov · Mengwei Ren · Jianming Zhang · Zhe Lin · Cihang Xie · Yuyin Zhou

[ ExHall D ]

Abstract
Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose LayerDecomp, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects.To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available.Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing.
Poster
Dar-Yen Chen · Hmrishav Bandyopadhyay · Kai Zou · Yi-Zhe Song

[ ExHall D ]

Abstract
We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. While one-step methods offer dramatic speed advantages, they typically suffer from quality degradation compared to their multi-step counterparts. Just as a panel of art critics provides comprehensive feedback by specializing in different aspects like composition, color, and technique, our approach maintains a large pool of specialized discriminator heads that collectively guide the generation process. Each discriminator group develops expertise in specific quality aspects at different noise levels, providing diverse feedback that enables high-fidelity one-step generation. Our framework combines: (i) a dynamic discriminator pool with specialized discriminator groups to improve generation quality, (ii) strategic refresh mechanisms to prevent discriminator overfitting, and (iii) global-local discriminator heads for multi-scale quality assessment, and unconditional/conditional training for balanced generation. Additionally, our framework uniquely supports flexible deployment through bottom-up refinement, allowing users to dynamically choose between 1-4 denoising steps with the same model for direct quality-speed trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion significantly outperforms existing single-step methods across multiple evaluation metrics, particularly excelling in preserving fine details and global consistency.
Poster
Lianghui Zhu · Zilong Huang · Bencheng Liao · Jun Hao Liew · Hanshu Yan · Jiashi Feng · Xinggang Wang

[ ExHall D ]

Abstract
Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at 256×256 resolution, DiG demonstrates greater efficiency than these methods starting from a 512 resolution. Specifically, DiG-S/2 is 2.5× faster and saves 75.7% GPU memory compared to DiT-S/2 at a 1792 resolution. Additionally, DiG-XL/2 is 4.2× faster than the Mamba-based model at a 1024 resolution and 1.8× faster than DiT with FlashAttention-2 at a 2048 resolution.
Poster
Jacob Yeung · Andrew Luo · Gabriel Sarch · Margaret Marie Henderson · Deva Ramanan · Michael J. Tarr

[ ExHall D ]

Abstract
While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. This framework advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with …
Poster
Lexington Whalen · Zhenbang Du · Haoran You · Chaojian Li · Sixu Li · Yingyan (Celine) Lin

[ ExHall D ]

Abstract
Training diffusion models (DMs) is highly computationally demanding, necessitating multiple forward and backward passes across numerous timesteps. This challenge has motivated the exploration of various efficient DM training techniques. In this paper, we propose EB-Diff-Train, a new and orthogonal efficient DM training approach by investigating and leveraging Early-Bird (EB) tickets—sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, which draw on insights from varying importance of different timestep regions/periods. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them through an ensemble during inference for image generation. This approach can significantly reduce training time both spatially and temporally—achieving 2.9×5.8× speedups over training unpruned dense models, and up to 10.3× faster training compared to standard train-prune-finetune pipelines—without compromising generative quality. Extensive experiments and ablation …
Poster
Yousef Yeganeh · Ioannis Charisiadis · Marta Hasny · Martin Hartenberger · Björn Ommer · Nassir Navab · Azade Farshad · Ehsan Adeli

[ ExHall D ]

Abstract
Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this …
Poster
Jian Wang · Xin Lan · Ji-Zhe Zhou · Yuxin Tian · Jiancheng Lv

[ ExHall D ]

Abstract
Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.
Poster
Yu Cao · Zengqun Zhao · Ioannis Patras · Shaogang Gong

[ ExHall D ]

Abstract
Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. We discover that a diffusion generative process undergoes three temporal phases: Profiling, Mutation, and Refinement. We observe that artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This, reveals why existing methods that quantify the spatial uncertainty of the final output, and therefore lack modeling of the temporal dynamics of the diffusion process, have shown to be inadequate for artifact localization. Based on these insights, we propose a novel method, ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, and a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, e.g., by applying a noising-denoising scheme after generation, our mitigation strategy does not involve additional diffusion steps.Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.
Poster
Hoigi Seo · Wongi Jeong · Kyungryeol Lee · Se Young Chun

[ ExHall D ]

Abstract
Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for the applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consume considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective …
Poster
Zhenyu Wang · Jianxi Huang · Zhida Sun · Yuanhao Gong · Daniel Cohen-Or · Min Lu

[ ExHall D ]

Abstract
This work presents a progressive image vectorization technique that reconstructs the raster image as layer-wise vectors from semantic-aligned macro structures to finer details. Our approach introduces a new image simplification method leveraging the feature-average effect in the Score Distillation Sampling mechanism, achieving effective visual abstraction from the detailed to coarse. Guided by the sequence of progressive simplified images, we propose a two-stage vectorization process of structural buildup and visual refinement, constructing the vectors in an organized and manageable manner. The resulting vectors are layered and well-aligned with the target image's explicit and implicit semantic structures. Our method demonstrates high performance across a wide range of images. Comparative analysis with existing vectorization methods highlights our technique's superiority in creating vectors with high visual fidelity, and more importantly, achieving higher semantic alignment and more compact layered representation.
Poster
Yiyang Ma · Xingchao Liu · Xiaokang Chen · Wen Liu · Chengyue Wu · Zhiyu Wu · Zizheng Pan · Zhenda Xie · Haowei Zhang · Xingkai Yu · Liang Zhao · Yisong Wang · Jiaying Liu · Chong Ruan

[ ExHall D ]

Abstract
We present **JanusFlow**, a powerful framework that unifies image understanding and generation in a single model.JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling.Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications.To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training.Extensive experiments show that JaunsFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches.This work represents a step toward more efficient and versatile vision-language models.
Poster
Hui Li · Mingwang Xu · Qingkun Su · Shan Mu · Li jiaye · Kaihui Cheng · Chen Yuxuan · Tan Chen · Mao Ye · Jingdong Wang · Siyu Zhu

[ ExHall D ]

Abstract
Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio.To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos.
Poster
Minghong Cai · Xiaodong Cun · Xiaoyu Li · Wenze Liu · Zhaoyang Zhang · Yong Zhang · Ying Shan · Xiangyu Yue

[ ExHall D ]

Abstract
Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance …
Poster
Shijie Wang · Samaneh Azadi · Rohit Girdhar · Sai Saketh Rambhatla · Chen Sun · Xi Yin

[ ExHall D ]

Abstract
Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This simple objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation, MotiF outperforms nine open-sourced models, achieving an average preference of 72%.
Poster
Zhengbo Zhang · Yuxi Zhou · DUO PENG · Joo Lim · Zhigang Tu · De Soh Soh · Lin Geng Foo

[ ExHall D ]

Abstract
One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made---using any image editing tool---on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.
Poster
Or Madar · Ohad Fried

[ ExHall D ]

Abstract
Image tiling—the seamless connection of disparate images to create a coherent visual field—is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains.This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images.Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.
Poster
Lingjun Mao · Zineng Tang · Alane Suhr

[ ExHall D ]

Abstract
We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.
Poster
Fatemeh Behrad · Tinne Tuytelaars · Johan Wagemans

[ ExHall D ]

Abstract
The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1%) using a lightweight ViT backbone. Code and pre-trained models will be made publicly available upon acceptance.
Poster
Jamie Wynn · Zawar Qureshi · Jakub Powierza · Jamie Watson · Mohamed Sayed

[ ExHall D ]

Abstract
Exploring real-world spaces using novel-view synthesis is fun, and reimagining those worlds in a different style adds another layer of excitement. Stylized worlds can also be used for downstream tasks where there is limited training data and a need to expand a model's training distribution. Most current novel-view synthesis stylization techniques lack the ability to convincingly change geometry. This is because any geometry change requires increased style strength which is often capped for stylization stability and consistency. In this work, we propose a new autoregressive 3D Gaussian Splatting stylization method. As part of this method, we contribute a new RGBD diffusion model that allows for strength control over appearance and shape stylization. To ensure consistency across stylized frames, we use a combination of novel depth-guided cross attention, feature injection, and a Warp ControlNet conditioned on composite frames for guiding the stylization of new frames. We validate our method via extensive qualitative results, quantitative experiments, and a user study.
Poster
Hyelin Nam · Jaemin Kim · Dohun Lee · Jong Chul Ye

[ ExHall D ]

Abstract
While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models. Project Page: https://motionprompt.github.io/anon7165/
Poster
Ye Wang · Ruiqi Liu · Jiang Lin · Fei Liu · Zili Yi · Yilin Wang · Rui Ma

[ ExHall D ]

Abstract
In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only improve the generalization of style transfer models through supervised training but also facilitates precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.
Poster
Noam Rotstein · Gal Yona · Daniel Silver · Roy Velich · David Bensaid · Ron Kimmel

[ ExHall D ]

Abstract
Generative image editing has recently seen substantial progress with the emergence of image diffusion models, though critical challenges remain. Models tend to lack fidelity, altering essential aspects of original images and still struggling to accurately follow complex edit instructions.Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects.Our approach achieves state-of-the-art results on text-based image editing benchmarks, demonstrating significant improvements in both edit accuracy and image preservation.
Poster
Ziqi Cai · Shuchen Weng · Yifei Xia · Boxin Shi

[ ExHall D ]

Abstract
Achieving joint control over material properties, lighting, and high-level semantics in images is essential for applications in digital media, advertising, and interactive design. Existing methods often isolate these properties, lacking a cohesive approach to manipulating materials, lighting, and semantics simultaneously. We introduce PhyS-EdiT, a novel diffusion-based model that enables precise control over four critical material properties: roughness, metallicity, albedo, and transparency while integrating lighting and semantic adjustments within a single framework. To facilitate this disentangled control, we present PR-TIPS, a large and diverse synthetic dataset designed to improve the disentanglement of material and lighting effects. PhyS-EdiT incorporates a dual-network architecture and robust training strategies to balance low-level physical realism with high-level semantic coherence, supporting localized and continuous property adjustments. Extensive experiments demonstrate the superiority of PhyS-EdiT in editing both synthetic and real-world images, achieving state-of-the-art performance on material, lighting, and semantic editing tasks.
Poster
Omri Avrahami · Or Patashnik · Ohad Fried · Egor Nemchinov · Kfir Aberman · Dani Lischinski · Daniel Cohen-Or

[ ExHall D ]

Abstract
Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify vital layers'' within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications.
Poster
Daneul Kim · Jaeah Lee · Jaesik Park

[ ExHall D ]

Abstract
Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.
Poster
Jiteng Mu · Nuno Vasconcelos · Xiaolong Wang

[ ExHall D ]

Abstract
Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
Poster
Vittorio Pippi · Fabio Quattrini · Silvia Cascianelli · Alessio Tonioni · Rita Cucchiara

[ ExHall D ]

Abstract
Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text-image generation, dubbed Emuru. Our approach leverages a powerful text-image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the …
Poster
Yu Yuan · Xijun Wang · Yichen Sheng · Prateek Chennuri · Xingguang Zhang · Stanley H. Chan

[ ExHall D ]

Abstract
Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
Poster
Sean J. Liu · Nupur Kumari · Ariel Shamir · Jun-Yan Zhu

[ ExHall D ]

Abstract
Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a _single_ image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by _compositing_ it from various parts of generated images, in essence forming a _Generative Photomontage_. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.
Poster
Han Yang · Chuanguang Yang · Qiuli Wang · Zhulin An · Weilun Feng · Libo Huang · Yongjun Xu

[ ExHall D ]

Abstract
The rapid development of diffusion models has fueled a growing demand for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free approach that enables high-quality image customization under both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments demonstrate that MCA-Ctrl outperforms previous methods in zero-shot image customization, effectively addressing the aforementioned issues.
Poster
Yifan Pu · Yiming Zhao · Zhicong Tang · Ruihong Yin · Haoxing Ye · Yuhui Yuan · Dong Chen · Jianmin Bao · Sirui Zhang · Yanbin Wang · Lin Liang · Lijuan Wang · Ji Li · Xiu Li · Zhouhui Lian · Gao Huang · Baining Guo

[ ExHall D ]

Abstract
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
Poster
Lunhao Duan · Shanshan Zhao · Wenjun Yan · Yinglun Li · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · Mingming Gong · Gui-Song Xia

[ ExHall D ]

Abstract
Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.
Poster
Jian Yang · Dacheng Yin · Yizhou Zhou · Fengyun Rao · Wei Zhai · Yang Cao · Zheng-Jun Zha

[ ExHall D ]

Abstract
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Extensive evaluations on 18 image understanding benchmarks show that MMAR significantly outperforms most of the existing joint multi-modal models, surpassing the method that employs pre-trained CLIP vision encoder. Meanwhile, MMAR is able to …
Poster
Chaehun Shin · Jooyoung Choi · Heeseung Kim · Sungroh Yoon

[ ExHall D ]

Abstract
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications.
Poster
Jierun Chen · Dongting Hu · Xijie Huang · Huseyin Coskun · Arpit Sahni · Aarush Gupta · Anujraaj Goyal · Dishani Lahiri · Rajesh Singh · Yerlan Idelbayev · Junli Cao · Yanyu Li · Kwang-Ting Cheng · Mingming Gong · S.-H. Gary Chan · Sergey Tulyakov · Anil Kag · Yanwu Xu · Jian Ren

[ ExHall D ]

Abstract
Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. Our model, for the first time, demonstrates the generation of 1024x1024 px images on a mobile device in 1.2 to 2.3 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).
Poster
Runtao Liu · Haoyu Wu · Zheng Ziqiang · Chen Wei · Yingqing He · Renjie Pi · Qifeng Chen

[ ExHall D ]

Abstract
Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected.
Poster
Meihua Dang · Anikait Singh · Linqi Zhou · Stefano Ermon · Jiaming Song

[ ExHall D ]

Abstract
RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user …
Poster
Jeeyung Kim · Erfan Esmaeili Fakhabi · Qiang Qiu

[ ExHall D ]

Abstract
In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, “a black car and a white clock”, the cross-attention maps for “black” and “car” should focus on overlapping regions to depict a black car, while “car” and “clock” should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens---used as conditioning inputs---can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across …
Poster
SeungJu Cha · Kwanyoung Lee · Ye-Chan Kim · Hyunwoo Oh · Dong-Jin Kim

[ ExHall D ]

Abstract
Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words.In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
Poster
Shuailei Ma · Kecheng Zheng · Ying Wei · Wei Wu · Fan Lu · Yifei Zhang · Chen-Wei Xie · Biao Gong · Jiapeng Zhu · Yujun Shen

[ ExHall D ]

Abstract
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant vision tasks, like image-to-3D and image-to-video. The code and the models will be made public.
Poster
Gianni Franchi · Nacim Belkhir · Dat NGUYEN · Guoxuan Xia · Andrea Pilzer

[ ExHall D ]

Abstract
Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in …
Poster
Wei Chen · Lin Li · Yongqi Yang · Bin Wen · Fan Yang · Tingting Gao · Yu Wu · Long Chen

[ ExHall D ]

Abstract
Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior …
Poster
Yifan Gao · Zihang Lin · Chuanbin Liu · Min Zhou · Tiezheng Ge · Bo Zheng · Hongtao Xie

[ ExHall D ]

Abstract
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90\%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
Poster
Gemma Canet Tarrés · Zhe Lin · Zhifei Zhang · He Zhang · Andrew Gilbert · John Collomosse · Soo Ye Kim

[ ExHall D ]

Abstract
We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like 'taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.
Poster
Sankalp Sinha · Mohammad Sadil Khan · Muhammad Usama · Shino Sam · Didier Stricker · Sk Aziz Ali · Muhammad Zeshan Afzal

[ ExHall D ]

Abstract
Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.
Poster
HsiaoYuan Hsu · Yuxin Peng

[ ExHall D ]

Abstract
In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image.With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO's abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research. Code and dataset will be publicly available upon acceptance.
Poster
Jiawei Lin · Shizhao Sun · Danqing Huang · Ting Liu · Ji Li · Jiang Bian

[ ExHall D ]

Abstract
In this work, we investigate automatic design composition from multimodal graphic elements.Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process.To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task.Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents.Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context.With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer.The experimental results demonstrate the effectiveness of LaDeCo in design composition.Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc.In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.
Poster
Kiyohiro Nakayama · Jan Ackermann · Timur Levent Kesdogan · Yang Zheng · Maria Korosteleva · Olga Sorkine-Hornung · Leonidas Guibas · Guandao Yang · Gordon Wetzstein

[ ExHall D ]

Abstract
Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and it enables novel multimodal garment generation applications such as interactive garment editing.
Poster
Jing Lin · Yao Feng · Weiyang Liu · Michael J. Black

[ ExHall D ]

Abstract
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that combines and integrates the skills of these specialized methods. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Adapting LLMs to 3D human tasks presents challenges, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovative features of ChatHuman include leveraging academic publications to instruct the LLM how to use the tools, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating and integrating tool results to enhance tasks related to 3D humans by transforming specialized 3D outputs into comprehensible formats. Our experiments demonstrate that ChatHuman surpasses …
Poster
Akshay R. Kulkarni · Ge Yan · Chung-En Sun · Tuomas Oikarinen · Tsui-Wei Weng

[ ExHall D ]

Abstract
Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, the existing approach to design interpretable generative models based on CBMs is not efficient and scalable, as it requires expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc interpretability and we name them concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our approach enables efficient and scalable training by using generated images and our method can work with minimal to no concept supervision. Our proposed methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average 25\%) over the prior work, while being 4-15× faster to train. Finally, we also perform a large-scale user study to validate the interpretability and steerability of our methods.
Poster
lingyun zhang · Yu Xie · Yanwei Fu · Ping Chen

[ ExHall D ]

Abstract
As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.
Poster
Chen Chen · Daochang Liu · Mubarak Shah · Chang Xu

[ ExHall D ]

Abstract
Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
Poster
Yingdong Shi · Changming Li · Yifan Wang · Yongxiang Zhao · Anqi Pang · Sibei Yang · Jingyi Yu · Kan Ren

[ ExHall D ]

Abstract
Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.
Poster
Sangwon Jang · June Suk Choi · Jaehyeong Jo · Kimin Lee · Sung Ju Hwang

[ ExHall D ]

Abstract
Text-to-image diffusion models have achieved remarkable success in generating high-quality contents from text prompts. However, their reliance on publicly available data and the growing trend of data sharing for fine-tuning make these models particularly vulnerable to data poisoning attacks. In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. We find that when certain visual patterns are repeatedly in the training data, the model learns to reproduce them naturally in its outputs, even without prompt mentions. Leveraging this, we develop an automated data poisoning algorithm that unobtrusively injects logos into original images, ensuring they blend naturally and remain undetected. Models trained on this poisoned dataset generate images containing logos without degrading image quality or text alignment. We experimentally validate our silent branding attack across two realistic settings on large-scale high-quality image datasets and style personalization datasets, achieving high success rates even without a specific text trigger. Human evaluation and quantitative metrics including logo detection show that our method can stealthily embed logos.
Poster
Zilan Wang · Junfeng Guo · Jiacheng Zhu · Yiming Li · Heng Huang · Muhao Chen · Zhengzhong Tu

[ ExHall D ]

Abstract
Recent advances in large-scale text-to-image (T2I) diffusion models have enabled a variety of downstream applications, including style customization, subject-driven personalization, and conditional generation. As T2I models require extensive data and computational resources for training, they constitute highly valued intellectual property (IP) for their legitimate owners, yet making them incentive targets for unauthorized fine-tuning by adversaries seeking to leverage these models for customized, usually profitable applications. Existing IP protection methods for diffusion models generally involve embedding watermark patterns and then verifying ownership through generated outputs examination, or inspecting the model’s feature space. However, these techniques are inherently ineffective in practical scenarios when the watermarked model undergoes fine-tuning, and the feature space is inaccessible during verification (\ie, black-box setting). The model is prone to forgetting the previously learned watermark knowledge when it adapts to a new task. To address this challenge, we propose SleeperMark, a novel framework designed to embed resilient watermarks into T2I diffusion models. SleeperMark explicitly guides the model to disentangle the watermark information from the semantic concepts it learns, allowing the model to retain the embedded watermark while continuing to be fine-tuned to new downstream tasks. Our extensive experiments demonstrate the effectiveness of SleeperMark across various types of diffusion …
Poster
Gaozhi Liu · Silu Cao · Zhenxing Qian · Xinpeng Zhang · Sheng Li · Wanli Peng

[ ExHall D ]

Abstract
The proliferation of digital images on the Internet has provided unprecedented convenience, but also poses significant risks of malicious theft and misuse. Digital watermarking has long been researched as an effective tool for copyright protection. However, it often falls short when addressing partial image theft, a common yet little-researched issue in practical applications. Most existing schemes typically require the entire image as input to extract watermarks. However, in practice, malicious users often steal only a portion of the image to create new content. The stolen portion can have arbitrary shape or content, being fused with a new background and may have undergone geometric transformations, making it challenging for current methods to extract correctly. To address the issues above, we propose WOFA (Watermarking One for All), a robust watermarking scheme against partial image theft. First of all, we define the entire process of partial image theft and construct a dataset accordingly. To gain robustness against partial image theft, we then design a comprehensive distortion layer that incorporates the process of partial image theft and several common distortions in channel. For easier network convergence, we employ a multi-level network structure on the basis of the commonly used embedder-distortion layer-extractor architecture and adopt …
Poster
Ali Salar · Qing Liu · Yingli Tian · Guoying Zhao

[ ExHall D ]

Abstract
The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance. The code is available at https://anonymous.4open.science/r/CVPR_2025-EACE.
Poster
Jeongsoo Park · Andrew Owens

[ ExHall D ]

Abstract
One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior works. As part of creating this dataset, we systematically download thousands of text-to-image latent diffusion models and sample images from them. We also collect images from dozens of popular open source and commercial models. The resulting dataset contains 2.7M images that have been sampled from 4803 different models. These images collectively capture a wide range of scene content, generator architectures, and image processing settings. Using this dataset, we study the generalization abilities of fake image detectors. Our experiments suggest that detection performance improves as the number of models in the training set increases, even when these models have similar architectures. We also find that increasing the diversity of the models improves detection performance, and that our trained detectors generalize better than those trained on other datasets.
Poster
Nan Zhong · Haoyu Chen · Yiran Xu · Zhenxing Qian · Xinpeng Zhang

[ ExHall D ]

Abstract
The prevalence of AI-generated images has evoked concerns regarding the potential misuse of image generation technologies. In response, numerous detection methods aim to identify AI-generated images by analyzing generative artifacts. Unfortunately, most detectors quickly become obsolete with the development of generative models. In this paper, we first design a low-level feature extractor that transforms spatial images into feature space, where different source images exhibit distinct distributions. The pretext task for the feature extractor is to distinguish between images that differ only at the pixel level. This image set comprises the original image as well as versions that have been subjected to varying levels of noise and subsequently denoised using a pre-trained diffusion model. We employ the diffusion model as a denoising tool rather than an image generation tool. Then, we frame the AI-generated image detection task as a one-class classification. We estimate the low-level intrinsic feature distribution of real photographic images and identify features that deviate from this distribution as indicators of AI-generated images. We evaluate our method against over 20 different generative models, including GenImage and DRCT-2M dataset. Extensive experiments demonstrate its effectiveness on AI-generated images produced not only by diffusion models but also by GANs, flow-based models, and …
Poster
Jingwei Zhang · Mohammad Jalali · Cheuk Ting Li · Farzan Farnia

[ ExHall D ]

Abstract
An interpretable comparison of generative models requires the identification of sample types produced differently by each of the involved models. While several quantitative scores have been proposed in the literature to rank different generative models, such score-based evaluations do not reveal the nuanced differences between the generative models in capturing various sample types. In this work, we attempt to solve a differential clustering problem to detect sample types expressed differently by two generative models. To solve the differential clustering problem, we propose a method called Fourier-based Identification of Novel Clusters (FINC) to identify modes produced by a generative model with a higher frequency in comparison to a reference distribution. FINC provides a scalable stochastic algorithm based on random Fourier features to estimate the eigenspace of kernel covariance matrices of two generative models and utilize the principal eigendirections to detect the sample types present more dominantly in each model. We demonstrate the application of the FINC method to large-scale computer vision datasets and generative model frameworks. Our numerical results suggest the scalability of the developed Fourier-based method in highlighting the sample types produced with different frequencies by widely-used generative models.
Poster
Otto Brookes · Maksim Kukushkin · Majid Mirmehdi · Colleen Stephens · Paula Dieguez · Thurston Cleveland Hicks · Sorrel CZ Jones · Kevin C. Lee · Maureen S. McCarthy · Amelia C. Meier · NORMAND E. · Erin G. Wessling · Roman M. Wittig · Kevin Langergraber · Klaus Zuberbühler · Lukas Boesch · Thomas Schmid · Mimi Arandjelovic · Hjalmar S. Kühl · Tilo Burghardt

[ ExHall D ]

Abstract
Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42\% mAP for convolutional and +3.75\% mAP for transformer-based models. Finally, we provide an …
Poster
Dhananjaya Jayasundara · Heng Zhao · Demetrio Labate · Vishal M. Patel

[ ExHall D ]

Abstract
Implicit Neural Representations (INRs) are continuous function learners for conventional digital signal representations. With the aid of positional embeddings and/or exhaustively fine-tuned activation functions, INRs have surpassed many limitations of traditional discrete representations. However, existing works only find a continuous representation for the digital signal by solely using a single, fixed activation function throughout the INR, and it has not yet been explored to match the INR to the given signal. As current INRs are not matched to the signal being represented by the INR, we hypothesize that this approach could restrict the representation power and generalization capabilities of INRs, limiting their broader applicability. A way to match the INR to the signal being represented is through matching the activation of each layer in the sense of minimizing the mean squared loss. In this paper, we introduce MIRE, a method to find the highly matched activation function for each layer in INR through dictionary learning. To showcase the effectiveness of the proposed method, we utilize a dictionary that includes seven activation atoms: Raised Cosines (RC), Root Raised Cosines (RRC), Prolate Spheroidal Wave Function (PSWF), Sinc, Gabor Wavelet, Gaussian, and Sinusoidal. Experimental results demonstrate that MIRE not only significantly improves INR …
Poster
Rui Wang · Shaocheng Jin · Ziheng Chen · Xiaoqing Luo · Xiaojun Wu

[ ExHall D ]

Abstract
Covariance matrices have proven highly effective across many scientific fields. Since these matrices lie within the Symmetric Positive Definite (SPD) manifold—a Riemannian space with intrinsic non-Euclidean geometry, the primary challenge in representation learning is to respect this underlying geometric structure. Drawing inspiration from the success of Euclidean deep learning, researchers have developed neural networks on the SPD manifolds for more faithful covariance embedding learning. A notable advancement in this area is the implementation of Riemannian batch normalization (RBN), which has been shown to improve the performance of SPD network models. Nonetheless, the Riemannian metric beneath the existing RBN might fail to effectively deal with the ill-conditioned SPD matrices (ICSM), undermining the effectiveness of RBN. In contrast, the Bures-Wasserstein metric (BWM) demonstrates superior performance for ill-conditioning. In addition, the recently introduced Generalized BWM (GBWM) parameterizes the vanilla BWM via an SPD matrix, allowing for a more nuanced representation of vibrant geometries of the SPD manifold. Therefore, we propose a novel RBN algorithm based on the GBW geometry, incorporating a learnable metric parameter. Moreover, the deformation of GBWM by matrix power is also introduced to further enhance the representational capacity of GBWM-based RBN. Experimental results on different datasets validate the effectiveness of …
Poster
Sara Al-Emadi · Yin Yang · Ferda Ofli

[ ExHall D ]

Abstract
Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce a suite RWDS consisting of three novel DG benchmarking datasets\footnote{The datasets are available in the Supplementary Materials} that focus on humanitarian and climate change applications. These datasets allow investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for …
Poster
Michele Mazzamuto · Antonino Furnari · Yoichi Sato · Giovanni Maria Farinella

[ ExHall D ]

Abstract
We address the challenge of unsupervised mistake detection in egocentric video of skilled human activities through the analysis of gaze signals. While traditional methods rely on manually labeled mistakes, our approach does not require mistake annotations, hence overcoming the need of domain-specific labeled data. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection, proposing to identify deviations in attention patterns measured through a gaze tracker with respect to those estimated by a gaze prediction model. Since predicting gaze in video is characterized by high uncertainty, we propose a novel gaze completion task, where eye fixations are predicted from visual observations and partial gaze trajectories, and contribute a novel gaze completion approach which explicitly models correlations between gaze information and local visual tokens. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes. Experiments highlight the effectiveness of the proposed approach in different settings, with relative gains up to +14%, +11%, and +5% in EPIC-Tent, HoloAssist and IndustReal respectively, remarkably matching results of supervised approaches without seeing any labels. We further show that gaze-based analysis is particularly useful in the presence of skilled …
Poster
Changchang Sun · Gaowen Liu · Charles Fleming · Yan Yan

[ ExHall D ]

Abstract
Recently, conditional diffusion models have gained increasing attention due to their impressive results for cross-modal synthesis.Typically, existing methods target at achieving strong alignment between conditioning input and generated output by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to improve the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Different from existing dance-to-music diffusion models, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning to train a sequential multi-modal U-Net structure. To ensure the accurate definition and selection of negative conditioning, we ingeniously leverage temporal correlations between music and dance videos, where positive and negative rhythmic visual cues and motion information are captured by playing dance videos forward and backward, respectively. By subjectively and objectively evaluating input-output correspondence in terms of dance-music beats and the quality of generated music, experimental results on …
Poster
Mingfei Chen · Israel D. Gebru · Ishwarya Ananthabhotla · Christian Richardt · Dejan Markovic · Steven Krenn · Todd Keebler · Jacob Sandakly · Alexander Richard · Eli Shlizerman

[ ExHall D ]

Abstract
We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint. The method learns the underlying acoustic transfer function that relates the signals acquired at the distributed microphones to the signal at the target viewpoint, using a limited number of known recordings. Unlike existing works, our method does not require constraints or prior knowledge of sound source details. Moreover, our method efficiently adapts to diverse room layouts, reference microphone configurations and unseen environments. To enable this, we introduce a visual-acoustic binding module that learns visual embeddings linked with local acoustic properties from panoramic RGB and depth data. We first leverage these embeddings to optimize the placement of reference microphones in any given scene. During synthesis, we leverage multiple embeddings extracted from reference locations to get adaptive weights for their contribution, conditioned on target viewpoint. We benchmark the task on both publicly available data and real-world settings. We demonstrate significant improvements over existing methods.
Poster
Sung Jin Um · Dongjin Kim · Sangmin Lee · Jung Uk Kim

[ ExHall D ]

Abstract
Sound source localization task aims to localize each region of sound-making objects in visual scenes. Recent methods, which rely on simple audio-visual correspondence, often struggle to accurately localize each object in complex environments, such as those with visually similar silent objects. To address these challenges, we propose a novel sound source localization framework, which incorporates detailed contextual information for fine-grained sound source localization. Our approach utilizes Multimodal Large Language Models (MLLMs) to generate detailed information through understanding audio-visual scenes. To effectively incorporate generated detail information, we propose two loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. By utilizing these losses, our method effectively performs precise localization through fine-grained audio-visual correspondence. Our extensive experiments on MUSIC and VGGSound datasets demonstrate significant improvements in both single- and multi-source sound localization. Our code and generated detail information will be made publicly available.
Poster
Shaofei Huang · Rui Ling · Tianrui Hui · Hongyu Li · Xu Zhou · Shifeng Zhang · Si Liu · Richang Hong · Meng Wang

[ ExHall D ]

Abstract
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
Poster
Jinxing Zhou · Dan Guo · Ruohao Guo · Yuxin Mao · Jingjing Hu · Yiran Zhong · Xiaojun Chang · Meng Wang

[ ExHall D ]

Abstract
The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible.Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as unknown'', but without providing category-specific semantics.In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with manual segment-level annotation.We also establish three evaluation metrics for this task.Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm.Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features.The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities.The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OV-AVEBench …
Poster
Hanlin Wang · Zhan Tong · Kecheng Zheng · Yujun Shen · Limin Wang

[ ExHall D ]

Abstract
The Audio Description (AD) task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. To enhance the alignment of features across various modalities with finer granularity, we introduce a simple and lightweight module that maps video features into the textual feature space. Moreover, we also propose a character-refinement module to provide more precise information by identifying the main characters who play more significant role in the video context. With these unique designs, we further incorporate contextual information and a contrastive loss into our architecture to generate more smooth and contextual ADs. Experiments on the MAD-eval dataset show that Uni-AD can achieve state-of-the-art performance on AD generation, which demonstrates the effectiveness of our approach.
Poster
Jiayuan Rao · Haoning Wu · Hao Jiang · Ya Zhang · Yanfeng Wang · Weidi Xie

[ ExHall D ]

Abstract
As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding.Specifically, we make the following contributions in this paper:(i) we introduce **SoccerReplay-1988**, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline;(ii) we present the first visual-language foundation model in the soccer domain, **MatchVision**, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks;(iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition,and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
Poster
S P Sharan · Minkyu Choi · Sahil Shah · Harsh Goel · Mohammad Omama · Sandeep P. Chinchali

[ ExHall D ]

Abstract
Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5× with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
Poster
Kaiyue Sun · Kaiyi Huang · Xian Liu · Yue Wu · Zihan Xu · Zhenguo Li · Xihui Liu

[ ExHall D ]

Abstract
Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of multimodal large language model (MLLM)-based, detection-based, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction.
Poster
Kangyi Wu · Pengna Li · Jingwen Fu · Yizhe Li · Yang Wu · Yuhan Liu · Jinjun Wang · Sanping Zhou

[ ExHall D ]

Abstract
Dense video captioning aims to localize and caption all events in arbitrary untrimmed videos. Although previous methods have achieved appealing results, they still face the issue of temporal bias, i.e, models tend to focus more on events with certain temporal characteristics. Specifically, 1) the temporal distribution of events in training datasets is uneven. Models trained on these datasets will pay less attention to out-of-distribution events. 2) long-duration events have more frame features than short ones and will attract more attention. To address this, we argue that events, with varying temporal characteristics, should be \textbf{treated equally} when it comes to dense video captioning. Intuitively, different events tend to have distinct visual differences due to varied camera views, backgrounds, or subjects. Inspired by that, we intend to utilize visual features to have an approximate perception of possible events and pay equal attention to them. In this paper, we introduce a simple but effective framework, called Event-Equalized Dense Video Captioning(E2DVC) to overcome the temporal bias and treat all possible events equally. Specifically, an event perception module(EPM) is proposed to do uneven clustering on visual frame features to generate pseudo-events. We enforce the model's attention to these pseudo-events through the pseudo-event initialization module(PEI). A …
Poster
Qiuheng Wang · Yukai Shi · Jiarong Ou · Rui Chen · Ke Lin · Jiahao Wang · Boyuan Jiang · Haotian Yang · Mingwu Zheng · Xin Tao · Fei Yang · Pengfei Wan · Di ZHANG

[ ExHall D ]

Abstract
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the segmented videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset.
Poster
Fida Mohammad Thoker · Letian Jiang · Chen Zhao · Bernard Ghanem

[ ExHall D ]

Abstract
Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal correlation, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new video self-supervised learning paradigm, capable of learning robust video representations without requiring any video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations.
Poster
Wenyi Hong · Yean Cheng · Zhuoyi Yang · Weihan Wang · Lefan Wang · Xiaotao Gu · Shiyu Huang · Yuxiao Dong · Jie Tang

[ ExHall D ]

Abstract
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability — fine-grained motion comprehension — remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content.Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions.To enhance VLM's ability to perceive fine-grained motion within a limited LLM sequence length, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression, and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs together with TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension.
Poster
Ziyang Luo · Haoning Wu · Dongxu Li · Jing Ma · Mohan Kankanhalli · Junnan Li

[ ExHall D ]

Abstract
Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation—and due to the prohibitive cost and slow pace of human annotation for video tasks—we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a gold standard" using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, …
Poster
Yilun Zhao · Lujing Xie · Haowei Zhang · Guo Gan · Weiyuan Chen · Yitao Long · Tongyan Hu · Zhijian Xu · Chengye Wang · Chuhan Li · Ziyao Shangguan · Yixin Liu · Zhenwen Liang · Zhiyuan Hu · Chen Zhao · Arman Cohan

[ ExHall D ]

Abstract
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls, validating that correct answers cannot be inferred solely from textual cues or answer shortcuts. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth. Our evaluation of 20 frontier models highlights a significant performance gap between the leading models and human experts. Through comprehensive error analysis, case studies, and an exploration of retrieval-augmented generation methods, we offer actionable insights to guide future advancements.
Poster
Yunlong Tang · JunJia Guo · Hang Hua · Susan Liang · Mingqian Feng · Xinyang Li · Rui Mao · Chao Huang · Jing Bi · Zeliang Zhang · Pooyan Fazli · Chenliang Xu

[ ExHall D ]

Abstract
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content.However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts.We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations.VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc.Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement.Our benchmark will be publicly available for evaluating more models.
Poster
Yujie Lu · Yale Song · Lorenzo Torresani · William Yang Wang · Tushar Nagarajan

[ ExHall D ]

Abstract
We investigate complex video question answering via chain-of-evidence reasoning --- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them.Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question.We train our model (ViTED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content.We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.
Poster
Yudong Han · Qingpei Guo · Liyuan Pan · Liu Liu · Yu Guan · Ming Yang

[ ExHall D ]

Abstract
The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.
Poster
Yudi Shi · Shangzhe Di · Qirui Chen · Weidi Xie

[ ExHall D ]

Abstract
This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose **A**gent-**o**f-**T**houghts **D**istillation (**AoTD**), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.
Poster
Yuanbin Man · Ying Huang · Chengming Zhang · Bingzhe Li · Wei Niu · Miao Yin

[ ExHall D ]

Abstract
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM2, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM2 achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5\% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65\%.
Poster
SuBeen Lee · WonJun Moon · Hyun Seok Seong · Jae-Pil Heo

[ ExHall D ]

Abstract
Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances.A key challenge in FSAR is handling divergent narrative trajectories for precise video matching.While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching.Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility.Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across novel classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM.
Poster
Shehreen Azad · Vibhav Vineet · Yogesh S. Rawat

[ ExHall D ]

Abstract
Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce **HierarQ**, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed **Hierar**chical **Q**uerying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on **10** video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ’s state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.
Poster
Yue Wu · Zhaobo Qi · Junshu Sun · Yaowei Wang · Qingming Huang · Shuhui Wang

[ ExHall D ]

Abstract
The development of video-language self-supervised models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, a recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from how masking strategies influence the model's attention to temporal dynamics.To validate this hypothesis, we conduct two sets of experiments demonstrating that alignment between masked object and reconstruction target is crucial for effective video-language self-supervised learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Through the combination of masking strategy and reconstruction decoder, STM enforces the model to learn the spatio-temporal feature representation more comprehensively. Experimental results across three video understanding downstream tasks validate the superiority of our method.
Poster
Zhihang Liu · Chen-Wei Xie · Pandeng Li · Liming Zhao · Longxiang Tang · Yun Zheng · Chuanbin Liu · Hongtao Xie

[ ExHall D ]

Abstract
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the \textbf{H}ybrid-level Instruction \textbf{I}njection Strategy for \textbf{C}onditional Token C\textbf{om}pression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 3.88\% average on three multiple-choice QA …
Poster
Jinhui Ye · Zihan Wang · Haosen Sun · Keshigeyan Chandrasegaran · Zane Durante · Cristobal Eyzaguirre · Yonatan Bisk · Juan Carlos Niebles · Ehsan Adeli · Li Fei-Fei · Jiajun Wu · Manling Li

[ ExHall D ]

Abstract
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: **First**, we formulate temporal search as a **Long Video Haystack** problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create **LV-Haystack**, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset.**Next**, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T , which casts the expensive temporal search as a spatial search problem. T leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T significantly improves …
Poster
Hongyu Li · Jinyu Chen · Ziyu Wei · Shaofei Huang · Tianrui Hui · Jialin Gao · Xiaoming Wei · Si Liu

[ ExHall D ]

Abstract
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages. LLaVA-ST achieves outstanding performance on 12 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released.
Poster
Yunseok Jang · Yeda Song · Sungryull Sohn · Lajanugen Logeswaran · Tiange Luo · Dong-Ki Kim · GyungHoon Bae · Honglak Lee

[ ExHall D ]

Abstract
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing agents capable of mobile operating system (mobile OS) navigation. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models trained on MONDAY demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving 21.41\%p better performance on previously unseen mobile OS configurations. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework combines robust OCR-based scene detection (95.04% F1-score), near-perfect UI component detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.
Poster
Xun Jiang · Zhiyi Huang · Xing Xu · Jingkuan Song · Fumin Shen · Heng Tao Shen

[ ExHall D ]

Abstract
Natural Language-based Egocentric Task Verification (NLETV) aims to equip agents with the ability to determine if operation flows of procedural tasks in egocentric videos align with natural language instructions. Describing rules with natural language provides generalizable applications, but also raises cross-modal heterogeneity and hierarchical misalignment challenges. In this paper, we proposed a novel approach termed Procedural Heterogeneous Graph Completion (PHGC), which addresses these challenges with heterogeneous graphs representing the logic in rules and operation flows. Specifically, our PHGC method mainly consists of three key components: (1) Heterogeneous Graph Construction module that defines objective states and operation flows as vertices, with temporal and sequential relations as edges. (2) Cross-Modal Path Finding module that aligns semantic relations between hierarchical video and text elements. (3) Discriminative Entity Representation module excavating hidden entities that integrate the general logical relations and discriminative cues to reveal final verification results. Additionally, we further constructed a new dataset called CSV-NL comprised of realistic videos. Extensive experiments on the two benchmark datasets covering both digital and physical scenarios, i.e., EgoTV and CSV-NL, demonstrate that our proposed PHGC establishes state-of-the-art performance across different settings. Our implementation is available at https://anonymous.4open.science/r/PHGC-7A1B.
Poster
Yichen Xiao · Shuai Wang · Dehao Zhang · Wenjie Wei · Yimeng Shan · Xiaoli Liu · Yulin Jiang · Malu Zhang

[ ExHall D ]

Abstract
Transformers significantly raise the performance limits across various tasks, spurring research into integrating them into spiking neural networks. However, a notable performance gap remains between existing spiking Transformers and their artificial neural network counterparts. Here, we first analyze the reason for this gap and identify that the dot product ineffectively calculates similarity between spiking Queries (Q) and Keys (K). To address this challenge, we introduce an innovative α-XNOR similarity calculation method tailored for spike trains. α-XNOR similarity redefines the correlation of non-spike pairs as a specific value α, effectively overcoming the limitations of dot-product similarity caused by numerous non-spiking events. Additionally, considering the sparse nature of spike trains where spikes carry more information than non-spikes, the α-XNOR similarity correspondingly highlights the distinct importance of spikes over non-spikes. Extensive experiments demonstrate that our α-XNOR similarity significantly improves performance across different spiking Transformer architectures in various static and neuromorphic datasets. This is the first attempt to develop a spiking self-attention paradigm tailored for the binary characteristics of spike trains.
Poster
Andong Deng · Tongjia Chen · Shoubin Yu · Taojiannan Yang · Lincoln Spencer · Yapeng Tian · Ajmal Mian · Mohit Bansal · Chen Chen

[ ExHall D ]

Abstract
In this paper, we introduce Motion-Grounded Video Reasoning, a new motionunderstanding task that requires generating visual answers (video segmentationmasks) according to the input question, and hence needs implicit spatiotemporalreasoning and grounding. This task extends existing spatiotemporal groundingwork focusing on explicit action/motion grounding, to a more general format byenabling implicit reasoning via questions. To facilitate the development of the newtask, we collect a large-scale dataset called GROUNDMORE, which comprises1,715 video clips, 249K object masks that are deliberately designed with 4 questiontypes (Causal, Sequential, Counterfactual, and Descriptive) for benchmarkingdeep and comprehensive motion reasoning abilities. GROUNDMORE uniquelyrequires models to generate visual answers, providing a more concrete and visuallyinterpretable response than plain texts. It evaluates models on both spatiotemporalgrounding and reasoning, fostering to address complex challenges in motion-relatedvideo reasoning, temporal perception, and pixel-level understanding. Furthermore,we introduce a novel baseline model named Motion-Grounded Video ReasoningAssistant (MORA). MORA incorporates the multimodal reasoning ability from theMultimodal LLM, the pixel-level perception capability from the grounding model(SAM), and the temporal perception ability from a lightweight localization head.MORA achieves respectable performance on GROUNDMORE outperforming thebest existing visual grounding baseline model by an average of 21.5% relatively.We hope this novel and challenging task will pave the way for future advancementsin robust …
Poster
Adnen Abdessaied · Anna Rohrbach · Marcus Rohrbach · Andreas Bulling

[ ExHall D ]

Abstract
We present **V2Dial** – a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.
Poster
Rohith Peddi · Saurabh . · Ayush Abhay Shrivastava · Parag Singla · Vibhav Giridhar Gogate

[ ExHall D ]

Abstract
Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose \textbf{ImparTail}, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks—Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation—designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.
Poster
Lang Lin · Xueyang Yu · Ziqi Pang · Yu-Xiong Wang

[ ExHall D ]

Abstract
This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between "Ref" and "VOS": they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that Global and Local consistency can be Unified into a single video Segmentation MLLM: a set of sparse "context frames" provides global information, while a stream of continuous "query frames" conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark.
Poster
Qin Liu · Jianfeng Wang · Zhengyuan Yang · Linjie Li · Kevin Lin · Marc Niethammer · Lijuan Wang

[ ExHall D ]

Abstract
Semi-supervised video object segmentation (VOS) has been largely driven by space-time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention. However, STM networks face memory limitations due to the quadratic complexity of softmax matching, restricting their applicability as video length and resolution increase. To address this, we propose LiVOS, a lightweight memory network that employs linear matching via linear attention, reformulating memory matching into a recurrent process that reduces the quadratic attention matrix to a constant-size, spatiotemporal-agnostic 2D state. To enhance selectivity, we introduce gated linear matching, where a data-dependent gate matrix is multiplied with the state matrix to control what information to retain or discard. Experiments on diverse benchmarks demonstrated the effectiveness of our method. It achieved 64.8 J&F on MOSE and 85.1 J&F on DAVIS, surpassing all non-STM methods and narrowing the gap with STM-based approaches. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU--a previously cost-prohibitive capability--opening the door for long and high-resolution video foundation models.
Poster
Muchao Ye · Weiyang Liu · Pan He

[ ExHall D ]

Abstract
The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions …
Poster
Yuzhi Huang · Chenxin Li · Haitao Zhang · Zixu Lin · yunlong lin · Hengyu Liu · Wuyang Li · Xinyu Liu · Jiechao Gao · Yue Huang · Xinghao Ding · Yixuan Yuan

[ ExHall D ]

Abstract
Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Albeit existing methods have primarily focused on detecting anomalous objects in videos—either by identifying anomalous frames or objects—they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose an innovative VAD framework called T_rack Any A_nomalous O_bject (\textit{TAO}), which introduces a Granular Video Anomaly Detection Framework that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel at each moment, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to subsequent tasks such as image segmentation and video tracking, our method eliminates the need for threshold selection and achieves more precise anomaly localization, even in long and challenging video sequences. Experiments on extensive datasets demonstrate that \textit{TAO} achieves state-of-the-art performance, setting a new progress for VAD by providing a practical, granular, and holistic solution.
Poster
Zhanzhong Pang · Fadime Sener · Angela Yao

[ ExHall D ]

Abstract
Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context.We uncover a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance.CMeRT improves learning for OAD and achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.
Poster
Ziyi Liu · Yangcen Liu

[ ExHall D ]

Abstract
Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of snippet-level annotations, leading to a performance and framework gap with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and Fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Then, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3, demonstrating its extensiveness in Point-supervised TAL (PTAL).
Poster
Yang Chen · Jingcai Guo · Song Guo · Dacheng Tao

[ ExHall D ]

Abstract
Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to new action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method. The code is anonymously available at: \url{https://anonymous.4open.science/r/Neuron}.
Poster
Xinqi Liu · Li Zhou · Zikun Zhou · Jianqiu Chen · Zhenyu He

[ ExHall D ]

Abstract
The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.
Poster
Youngjoon Jang · Haran Raajesh · Liliane Momeni · Gul Varol · Andrew Zisserman

[ ExHall D ]

Abstract
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a novel translation model. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (I) captions describing the background show; (ii) pseudo-glosses transcribing the signing; and (iii) translation of previous sentences. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
Poster
Song Xia · Yi Yu · Wenhan Yang · MEIWEN DING · Zhuo Chen · Ling-Yu Duan · Alex C. Kot · Xudong Jiang

[ ExHall D ]

Abstract
By locally encoding raw data into intermediate features, collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers.However, recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs).Obfuscation-based methods, such as noise corruption, adversarial representation learning, and information filters, enhance the inversion robustness by obfuscating the task-irrelevant redundancy empirically.However, methods for quantifying such redundancy remain elusive, and the explicit mathematical relation between this redundancy and worst-case robustness against inversion has not yet been established.To address that, this work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA.Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy maximization (CEM) algorithm to enhance the inversion robustness. Experimental results on four datasets demonstrate the effectiveness and adaptability of our proposed CEM; without compromising the feature utility and computing efficiency, integrating the proposed CEM into obfuscation-based defense mechanisms consistently boosts their inversion robustness, …
Poster
Shuaiwei Yuan · Junyu Dong · Yuezun Li

[ ExHall D ]

Abstract
With the advancement of AI generative techniques, Deepfake faces have become incredibly realistic and nearly indistinguishable to the human eye. To counter this, Deepfake detectors have been developed as reliable tools for assessing face authenticity. These detectors are typically developed on Deep Neural Networks (DNNs) and trained using third-party datasets. However, this protocol raises a new security risk that can seriously undermine the trustfulness of Deepfake detectors: Once the third-party data providers insert poisoned (corrupted) data maliciously, Deepfake detectors trained on these datasets will be injected backdoors'' that cause abnormal behavior when presented with samples containing specific triggers. This is a practical concern, as third-party providers may distribute or sell these triggers to malicious users, allowing them to manipulate detector performance and escape accountability.This paper investigates this risk in depth and describes a solution to stealthily infect Deepfake detectors. Specifically, we develop a trigger generator, that can synthesize passcode-controlled, semantic-suppression, adaptive, and invisible trigger patterns, ensuring both the stealthiness and effectiveness of these triggers. Then we discuss two poisoning scenarios, dirty-label poisoning and clean-label poisoning, to accomplish the injection of backdoors. Extensive experiments demonstrate the effectiveness, stealthiness, and practicality of our method compared to several baselines.
Poster
Hossein Kashiani · Niloufar Alipour Talemi · Fatemeh Afghah

[ ExHall D ]

Abstract
Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.
Poster
Guocheng Qian · Kuan-Chieh Wang · Or Patashnik · Negin Heravi · Daniil Ostashev · Sergey Tulyakov · Daniel Cohen-Or · Kfir Aberman

[ ExHall D ]

Abstract
We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach leverages a few-to-many identity reconstruction training paradigm, where a few images of an individual serve as input to reconstruct multiple target images of the same individual in varied poses and expressions. To train the Omni-ID encoder, we use a multi-decoder framework that leverages the complementary strengths of different decoders during the representation learning phase. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.
Poster
Kang You · Ziling Wei · Jing Yan · Boning Zhang · Qinghai Guo · Yaoyu Zhang · Zhezhi He

[ ExHall D ]

Abstract
Visual streaming perception (VSP) involves online intelligent processing of sequential frames captured by vision sensors, enabling real-time decision-making in applications such as autonomous driving, UAVs, and AR/VR. However, the computational efficiency of VSP on edge devices remains a challenge due to power constraints and the underutilization of temporal dependencies between frames. While spiking neural networks (SNNs) offer biologically inspired event-driven processing with potential energy benefits, their practical advantage over artificial neural networks (ANNs) for VSP tasks remains unproven.In this work, we introduce a novel framework, VISTREAM, which leverages the Law of Charge Conservation (LoCC) property in ST-BIF neurons and a differential encoding (DiffEncode) scheme to optimize SNN inference for VSP. By encoding temporal differences between neighboring frames and eliminating frequent membrane resets, VISTREAM achieves significant computational efficiency while maintaining accuracy equivalent to its ANN counterpart. We provide theoretical proofs of equivalence and validate VISTREAM across diverse VSP tasks, including object detection, tracking, and segmentation, demonstrating substantial energy savings without compromising performance.
Poster
Kairong Yu · Chengting Yu · Tianqing Zhang · Xiaochen Zhao · Shu Yang · Hongwei Wang · Qiang Zhang · Qi Xu

[ ExHall D ]

Abstract
Spiking Neural Networks (SNNs), inspired by the human brain, offer significant computational efficiency through discrete spike-based information transfer. Despite their potential to reduce inference energy consumption, a performance gap persists between SNNs and Artificial Neural Networks (ANNs), primarily due to current training methods and inherent model limitations. While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. This approach improves existing SNN distillation techniques by performing distillation learning on logits across different time steps, rather than merely on aggregated output features. Furthermore, the integration of entropy regularization stabilizes model optimization and further boosts the performance. Extensive experimental results indicate that our method surpasses prior SNN distillation strategies, whether based on logit distillation, feature distillation, or a combination of both. The code will be available on GitHub.
Poster
Shiyang Zhou · Haijin Zeng · Yunfan Lu · Tong Shao · Ke Tang · Yongyong Chen · Jie Liu · Jingyong Su

[ ExHall D ]

Abstract
Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized semantic Mamba (BiS-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. BiS-Mamba also incorporates additional semantic information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Codes and models are available in the supplementary materials.
Poster
Yan Jiang · Hao Yu · Xu Cheng · Haoyu Chen · Zhaodong Sun · Guoying Zhao

[ ExHall D ]

Abstract
Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to state-of-the-art methods trained with shared data (privacy-unrestricted). We hope this work …
Poster
Leiye Liu · Miao Zhang · Jihao Yin · Tingwei Liu · Wei Ji · Yongri Piao · Huchuan Lu

[ ExHall D ]

Abstract
Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning (DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code will be open-sourced after the paper is published.
Poster
Woojin Lee · Hyugjae Chang · Jaeho Moon · Jaehyup Lee · Munchurl Kim

[ ExHall D ]

Abstract
Weakly supervised Oriented Object Detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox) supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO that is a framework for WS-OOD. Our ABBSPO addresses the limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with predicted RBoxes’ minimum circumscribed rectangles, often leading to inaccuracies. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS) that appropriately scales the GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate prediction for RBoxes' scales; and (ii) a Symmetric Prior Angle (SPA) loss that uses the inherent symmetry of aerial objects for self-supervised learning, addressing the issue in previous methods where learning fails if they consistently make incorrect predictions for all three augmented views (original, rotated, and flipped). Extensive experimental results demonstrate that our ABBSPO achieves state-of-the-art results, outperforming existing methods.
Poster
Zhong-Yu Li · Xin Jin · Bo-Yuan Sun · Chun-Le Guo · Ming-Ming Cheng

[ ExHall D ]

Abstract
Existing object detection methods often consider sRGB input, which was compressed from RAW data using ISP originally designed for visualization. However, such compression might lose crucial information for detection, especially under complex light and weather conditions. We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing a broad range of indoor and outdoor scenes under 9 distinct light and weather conditions. Based on AODRaw that supports RAW and sRGB object detection, we provide a comprehensive benchmark for evaluating current detection methods. We find that sRGB pre-training constrains the potential of RAW object detection due to the domain gap between sRGB and RAW, prompting us to directly pre-train on the RAW domain. However, it is harder for RAW pre-training to learn rich representations than sRGB pre-training due to the camera noise. To assist RAW pre-training, we distill the knowledge from an off-the-shelf model pre-trained on the sRGB domain. As a result, we achieve substantial improvements under diverse and adverse conditions without relying on extra pre-processing modules. The code and dataset will be made publicly available.
Poster
Zhejun Zhang · Peter Karkus · Maximilian Igl · Wenhao Ding · Yuxiao Chen · Boris Ivanovic · Marco Pavone

[ ExHall D ]

Abstract
Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. Our code will be made publicly available.
Poster
Minshan XIE · Jian Lin · Hanyuan Liu · Chengze Li · Tien-Tsin Wong

[ ExHall D ]

Abstract
Manga, a popular form of multimodal artwork, has traditionally been overlooked in deep learning advancements due to the absence of a robust dataset and comprehensive annotation. Manga segmentation is the key to the digital migration of manga. There exists a significant domain gap between the manga and the natural images, that fails most existing learning-based methods. To address this gap, we introduce an augmented segmentation annotation for the Manga109 dataset, a collection of 109 manga volumes, that offers intricate artworks in a rich variety of styles. We introduce a detail annotation that extends beyond the original simple bounding boxes to the segmentation masks with pixel-level precision. It provides object category, location, and instance information that can be used for semantic segmentation and instance segmentation. We also provide a comprehensive analysis of our annotation dataset from various aspects. We further measure the improvement of state-of-the-art segmentation model after training it with our augmented dataset. The benefits of this augmented dataset are profound, with the potential to significantly enhance manga analysis algorithms and catalyze the novel development in digital art processing and cultural analytics. The annotation will be publicly available upon acceptance.
Poster
qian deng · Le Hui · Jin Xie · Jian Yang

[ ExHall D ]

Abstract
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding boxes by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box annotations. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using …
Poster
Edward LOO · Jiacheng Deng

[ ExHall D ]

Abstract
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features.In light of these disadvantages, we propose Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features.Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism.Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.
Poster
Shuai Liu · Mingyue Cui · Boyang Li · Quanmin Liang · Tinghe Hong · Kai Huang · yunxiao shan · Kai Huang

[ ExHall D ]

Abstract
Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code will be publicly available.
Poster
Weijie Wei · Osman Ülger · Fatemeh Karimi Nejadasl · Theo Gevers · Martin R. Oswald

[ ExHall D ]

Abstract
Open-vocabulary segmentation methods offer promising capabilities in detecting unseen object categories, but the vocabulary must be known and needs to be provided by a human, either via a text prompt or pre-labeled datasets, thus limiting their scalability. We propose 3D-AVS, a method for Auto-Vocabulary Segmentation of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime, thus eliminating the human in the loop and typically providing a substantially larger vocabulary for richer annotations. 3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary. Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions where geometric information from LiDAR is especially valuable. Our point-based recognition features a Sparse Masked Attention Pooling (SMAP) module to enrich the diversity of recognized objects. To address the challenges of evaluating unknown vocabularies and avoid annotation biases from label synonyms, hierarchies, or semantic overlaps, we introduce the annotation-free Text-Point Semantic Similarity (TPSS) metric for assessing segmentation quality. Our evaluations on nuScenes and ScanNet demonstrate our method's ability to generate semantic classes with accurate point-wise segmentations.
Poster
Chenyi Zhang · Ting Liu · Xiaochao Qu · Luoqi Liu · Yao Zhao · Yunchao Wei

[ ExHall D ]

Abstract
Interactive segmentation is a pivotal task in computer vision, focused on predicting precise masks with minimal user input. Although the click has recently become the most prevalent form of interaction due to its flexibility and efficiency, its advantages diminish as the complexity and details of target objects increase because it's time-consuming and user-unfriendly to precisely locate and click on narrow, fine regions. To tackle this problem, we propose NTClick, a powerful click-based interactive segmentation method capable of predicting accurate masks even with imprecise user clicks when dealing with intricate targets. We first introduce a novel interaction form called Noist-tolerant Click, a type of click that does not require user's precise localization when selecting fine regions. Then, we design a two-stage workflow, consisting of an Explicit Coarse Perception network for initial estimation and a High Resolution Refinement network for final classification. Quantitative results across extensive datasets demonstrate that NTClick not only maintains an efficient and flexible interaction mode but also significantly outperforms existing methods in segmentation accuracy.
Poster
Cong Wei · Haoxian Tan · Yujie Zhong · Yong Liu · Jie Hu · Dengjie Li · Zheng Zhao · Yujiu Yang

[ ExHall D ]

Abstract
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve accurate understanding of fine-grained visual text correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring challenging reasoning abilities and world knowledge. Besides, to fully leverage the recognition capacity of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for distinct segmentation tasks. Combined with temporal adapter, HyperSeg chieves a comprehensive understanding of space-time information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks.
Poster
Chinedu Innocent Nwoye · Kareem elgohary · Anvita A. Srinivas · Fauzan Zaid · Joël L. Lavanchy · Nicolas Padoy

[ ExHall D ]

Abstract
Tool tracking in surgical videos is essential for advancing computer-assisted interventions, such as skill assessment, safety zone estimation, and human-machine collaboration. However, the lack of context-rich datasets limits AI applications in this field. Existing datasets rely on overly generic tracking formalizations that fail to capture surgical-specific dynamics, such as tools moving out of the camera’s view or exiting the body. This results in less clinically relevant trajectories and a lack of flexibility for real-world surgical applications. Methods trained on these datasets often struggle with visual challenges such as smoke, reflection, and bleeding, further exposing the limitations of current approaches. We introduce CholecTrack20, a specialized dataset for multi-class, multi-tool tracking in surgical procedures. It redefines tracking formalization with three perspectives: (1) intraoperative, (2) intracorporeal, and (3) visibility, enabling adaptable and clinically meaningful tool trajectories. The dataset comprises 20 full-length surgical videos, annotated at 1 fps, yielding over 35K frames and 65K labeled tool instances. Annotations include spatial location, category, identity, operator, phase, and visual challenges. Benchmarking state-of-the-art methods on CholecTrack20 reveals significant performance gaps, with current approaches (<45% HOTA) failing to meet the accuracy required for clinical translation. These findings motivate the need for advanced and intuitive tracking algorithms and …
Poster
Jiawei Fu · ZHANG Tiantian · Kai Chen · Qi Dou

[ ExHall D ]

Abstract
Scene graph generation is a pivotal task in computer vision, aiming to identify all visual relation tuples within an image. The advancement of methods involving triplets has sought to enhance task performance by integrating triplets as contextual features for more precise predicate identification from component level. However, challenges remain due to interference from multi-role objects in overlapping tuples within complex environments, which impairs the model's ability to distinguish and align specific triplet features for reasoning diverse semantics of multi-role objects.To address these issues, we introduce a novel framework that incorporates a triplet alignment model into a hybrid reciprocal transformer architecture, starting from using triplet mask features to guide the learning of component-level relation graphs. To effectively distinguish multi-role objects characterized by overlapping visual relation tuples, we introduce a triplet alignment loss, which provides multi-role objects with aligned features from triplet and helps customize them.Additionally, we explore the inherent connectivity between hybrid aligned triplet and component features through a bidirectional refinement module, which enhances feature interaction and reciprocal reinforcement. Experimental results demonstrate that our model achieves state-of-the-art performance on the Visual Genome and Action Genome datasets, underscoring its effectiveness and adaptability.The code will be available.
Poster
Brian Chao · Hung-Yu Tseng · Lorenzo Porzi · Chen Gao · Tuotuo Li · Qinbo Li · Ayush Saraf · Jia-Bin Huang · Johannes Kopf · Gordon Wetzstein · Changil Kim

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has recently emerged as a state-of-the-art 3D reconstruction and rendering technique due to its high-quality results and fast training and rendering time. However, pixels covered by the same Gaussian are always shaded in the same color up to a Gaussian falloff scaling factor. Furthermore, the finest geometric detail any individual Gaussian can represent is a simple ellipsoid. These properties of 3DGS greatly limit the expressivity of individual Gaussian primitives. To address these issues, we draw inspiration from texture and alpha mapping in traditional graphics and integrate it with 3DGS. Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha (A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. As such, each Gaussian can represent a richer set of texture patterns and geometric structures, instead of just a single color and ellipsoid as in naive Gaussian Splatting. Surprisingly, we found that the expressivity of Gaussians can be greatly improved by using alpha-only texture maps, and further augmenting Gaussians with RGB texture maps achieves the highest expressivity. We validate our method on a wide variety of standard benchmark datasets and our own custom …
Poster
Wei Deng · Mengshi Qi · Huadong Ma

[ ExHall D ]

Abstract
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few work studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is presented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, \ie room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discrete the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing …
Poster
Sayan Deb Sarkar · Ondrej Miksik · Marc Pollefeys · Daniel Barath · Iro Armeni

[ ExHall D ]

Abstract
Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities—RGB images, point clouds, CAD models, floorplans, and text descriptions—without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver's adaptability for real-world applications in 3D scene understanding.
Poster
Duo Zheng · Shijia Huang · Liwei Wang

[ ExHall D ]

Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
Poster
Xuewu Lin · Tianwei Lin · Alan Huang · HONGYU XIE · Zhizhong Su

[ ExHall D ]

Abstract
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments.Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity.In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods.Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception.In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69\% in the 3D detection task and 15.25\% in the 3D visual grounding task.
Poster
Atharv Mahesh Mane · Dulanga Weerakoon · Vigneshwaran Subbaraju · Sougata Sen · Sanjay Sarma · Archan Misra

[ ExHall D ]

Abstract
3-Dimensional Embodied Reference Understanding (3D-ERU) is a complex vision-language task, crucial for supporting interactions between humans and situated AI agents. 3D-ERU combines a language description and an accompanying pointing gesture to identify the most relevant target object in a given 3D scene. While previous research has extensively explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-- **Imputer**, and use it to curate a new challenging benchmark dataset-- **ImputeRefer** for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose **Ges3ViG**, a novel model for 3D-ERU that achieves a 30\% improvement in accuracy, compared to other 3D-ERU models and 9\% compared to other purely language-based 3D grounding models. Our code and data will be released publicly upon acceptance of the paper.
Poster
YUEJIAO SU · Yi Wang · Qiongyang Hu · Chuang Yang · Lap-Pui Chau

[ ExHall D ]

Abstract
Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses.Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions.The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
Poster
Zaijing Li · Yuquan Xie · Rui Shao · Gongwei Chen · Dongmei Jiang · Liqiang Nie

[ ExHall D ]

Abstract
Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts in training Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.
Poster
Di Zhang · Jingdi Lei · Junxian Li · Xunzhi Wang · Yujie Liu · Zonglin Yang · Jiatong LI · Weida Wang · Suorong Yang · Jianbo Wu · Peng Ye · Wanli Ouyang · Dongzhan Zhou

[ ExHall D ]

Abstract
Vision-language models (VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths.In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward (RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. …
Poster
Yuhao Dong · Zuyan Liu · Hai-Long Sun · Jingkang Yang · Winston Hu · Yongming Rao · Ziwei Liu

[ ExHall D ]

Abstract
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model, our method shows an average improvement of 7.5% across …
Poster
Jae Sung Park · Zixian Ma · Linjie Li · Chenhao Zheng · Cheng-Yu Hsieh · Ximing Lu · Khyathi Chandu · Quan Kong · Norimasa Kobori · Ali Farhadi · Yejin Choi · Ranjay Krishna

[ ExHall D ]

Abstract
Understanding and reasoning over visual relationships—spatial, functional, interactional, social, etc.—are considered to be a fundamental component of human cognition.Yet, despite the major advances in visual comprehension in multimodal language models, precise reasoning over relationships remains a challenge. We introduce Robin: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train Robin, we curate SVG, a scene graph based instruction tuning dataset containing 33K images and 855K relationships for 170K objects by completing the missing relations in existing scene graphs using GPT4-V and a carefully designed filtering process—combining rule-based and model-based filtering techniques to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o refines Robin's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. Results show that our Robin-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.2, surpassing the previous best of 87.4. Our results suggest that training on …
Poster
Yexin Liu · Zhengyang Liang · Yueze Wang · Xianfeng Wu · feilong tang · Muyang He · Jian Li · Zheng Liu · Harry Yang · Ser-Nam Lim · Bo Zhao

[ ExHall D ]

Abstract
**M**ultimodal **L**arge **L**anguage **M**odels (MLLMs) have displayed remarkable performance in multimodal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that directly" relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs’ attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by …
Poster
Geng Li · Jinglin Xu · Yunzhen Zhao · Yuxin Peng

[ ExHall D ]

Abstract
Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose DyFo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data modifications, DyFo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content without the need for vocabulary expansion or specialized localization modules. Experimental results demonstrate that DyFo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and variable resolution models.
Poster
Vésteinn Snæbjarnarson · Kevin Du · Niklas Stoehr · Serge Belongie · Ryan Cotterell · Nico Lang · Stella Frank

[ ExHall D ]

Abstract
When a vision-and-language model (VLM) is prompted to identify an entity in an image, it may err on the side of caution and answer with "tree", instead of a more specific description such as "Pine tree''. Traditional binary accuracy metrics cannot differentiate between wrong predictions and insufficiently specific ones. They also do not give partial credit for close answers: "pine tree'' for a Norway Spruce should be better than "cypress'', taxonomically speaking, but string matching-based similarity measures will reject both equally.To address this shortcoming, we propose a framework for evaluating open-ended text predictions against a taxonomic hierarchy,using measures of hierarchical precision and recall to measure the level of correctness and specificity of predictions.We first show that existing text similarity measures and accuracy-based evaluation metrics do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the free-form outputs and the ground truth labels.Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our taxonomic evaluation. We find that models respond differently to instructions prompting for more specific answers, with GPT4V responding most specifically and others showing a trade-off between …
Poster
Kartik Thakral · Tamar Glaser · Tal Hassner · Mayank Vatsa · Richa Singh

[ ExHall D ]

Abstract
Existing unlearning algorithms in text-to-image generative models often fail to preserve knowledge of semantically related concepts when removing specific target concepts—a challenge known as \textit{adjacency}. To address this, we propose \textbf{FADE} (\textit{Fine-grained Attenuation for Diffusion Erasure}), introducing adjacency-aware unlearning in diffusion models. FADE comprises two components: (1) the \textbf{Concept Lattice}, which identifies an adjacency set of related concepts, and (2) \textbf{Mesh Modules}, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet-1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving at least a \textbf{12\% improvement in retention performance} over state-of-the-art methods.
Poster
Zhikai Li · Xuewen Liu · Dongrong Joe Fu · Jianquan Li · Qingyi Gu · Kurt Keutzer · Zhen Dong

[ ExHall D ]

Abstract
The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3× faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard …
Poster
Xuecheng Wu · Heli Sun · Yifan Wang · Jiayu Nie · Jie Zhang · Yabing Wang · Junxiao Xue · Liang He

[ ExHall D ]

Abstract
Affective Video Facial Analysis (AVFA) is important for advancing emotion-aware AI, yet the persistent data scarcity in AVFA presents challenges. Recently, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained significant attention, particularly in its audio-visual adaptation. Insights from general domains suggest that scaling is vital for unlocking impressive improvements, though its effects on AVFA remain largely unexplored. Additionally, capturing both intra- and inter-modal correlations through robust representations is a crucial challenge in this field. To tackle these gaps, we introduce AVF-MAE++, a series audio-visual MAE designed to explore the impact of scaling on AVFA with a focus on advanced correlation modeling. Our method incorporates a novel audio-visual dual masking strategy and an improved modality encoder with a holistic view to better support scalable pre-training. Furthermore, we propose the Iteratively Audio-Visual Correlations Learning Module to improve correlations capture within the SSL framework, bridging the limitations of prior methods. To support smooth adaptation and mitigate overfitting, we also introduce a progressive semantics injection strategy, which structures training in three stages. Extensive experiments across 17 datasets, spanning three key AVFA tasks, demonstrate the superior performance of AVF-MAE++, establishing new state-of-the-art results. Ablation studies provide further insights into the critical design …
Poster
Xiaoqin Wang · Xusen Ma · Xianxu Hou · Meidan Ding · Yudong Li · Junliang Chen · Wenting Chen · Xiaoyang Peng · Linlin Shen

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by multi-modal training with our proposed face instruction-tuning data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, which are also comapred with human. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released upon acceptance of this work.
Poster
Qihui Zhang · Munan Ning · Zheyuan Liu · Yanbo Wang · Jiayi Ye · Yue Huang · Shuo Yang · Xiao Chen · Yibing Song · Li Yuan

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation mechanisms face limitations due to the significant human workload required to design Q\&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce human workload through mutual model evaluations, they often introduce biases.To address these problems, we propose an unsupervised evaluation method—**U**nsupervised **P**eer review **M**LLM **E**valuation framework. This framework utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload.Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) the model capability of visual understanding and reasoning; (iii) relevance of text-image matching.Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our UPME framework closely aligns with human-designed QA benchmarks and inherent human preferences.
Poster
Silin Cheng · Yang Liu · Xinwei He · Sebastien Ourselin · Lei Tan · Luo

[ ExHall D ]

Abstract
Weakly supervised referring expression comprehension (WREC) and segmentation (WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement (DVFE) and Collaborative Consistency Module (CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES …
Poster
Jiansheng Li · Xingxuan Zhang · Hao Zou · Yige Guo · Renzhe Xu · Yilong Liu · Chuzhao Zhu · Yue He · Peng Cui

[ ExHall D ]

Abstract
Current object detectors often suffer significant performance degradation in real-world applications when encountering distributional shifts, posing serious risks in high-stakes domains such as autonomous driving and medical diagnosis. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD) and OODG. OODOD is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially enhance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the …
Poster
Federico Cocchi · Nicholas Moratelli · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara

[ ExHall D ]

Abstract
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and models will be publicly released.
Poster
Daniel Samira · Edan Habler · Yuval Elovici · Asaf Shabtai

[ ExHall D ]

Abstract
The proliferation of multi-modal generative models has introduced new privacy and security challenges, especially due to the risks of memorization and unintentional disclosure of sensitive information. This paper focuses on the vulnerability of multi-modal image captioning models to membership inference attacks (MIAs). These models, which synthesize textual descriptions from visual content, could inadvertently reveal personal or proprietary data embedded in their training datasets. We explore the feasibility of MIAs in the context of such models. Specifically, our approach leverages a variance-based strategy tailored for image captioning models, utilizing only image data without knowing the corresponding caption. We introduce the means-of-variance threshold attack (MVTA) and confidence-based weakly supervised attack (C-WSA) based on the metric, means-of-variance (MV), to assess variability among vector embeddings. Our experiments demonstrate that these models are susceptible to MIAs, indicating substantial privacy risks. The effectiveness of our methods is validated through rigorous evaluations on these real-world models, confirming the practical implications of our findings.
Poster
Jiayu Jiang · Changxing Ding · Wentao Tan · Junhong Wang · JIN Tao · Xiangmin Xu

[ ExHall D ]

Abstract
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization …
Poster
Huakai Lai · Guoxin Xiong · Huayu Mai · Xiang Liu · Tianzhu Zhang

[ ExHall D ]

Abstract
Video-Text Retrieval (VTR) is a core task in multi-modal understanding, drawing growing attention from both academia and industry in recent years. While numerous VTR methods have achieved success, most of them assume accurate visual-text correspondences during training, which is difficult to ensure in practice due to ubiquitous noise, known as noisy correspondences (NC). In this work, we rethink how to mitigate the NC from the perspective of representative reference features (termed agents), and propose a novel relation-aware purified consistency (RPC) network to amend direct pairwise correlation, including representative agents construction and relation-aware ranking distribution alignment. The proposed RPC enjoys several merits. First, to learn the agents well without any correspondence supervision, we customize the agents construction according to the three characteristics of reliability, representativeness, and resilience. Second, the ranking distribution-based alignment process leverages the structural information inherent in inter-pair relationships, making it more robust compared to individual comparisons. Extensive experiments on five datasets under different settings demonstrate the efficacy and robustness of our method.
Poster
Yiheng Li · Yang Yang · Zichang Tan · Huan Liu · Weihua Chen · Xu Zhou · Zhen Lei

[ ExHall D ]

Abstract
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation (DGM4) has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local contents, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair.Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features.Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content.Then, the forgery-aware reasoning or aggregating is adopted to deeply capture forgery cues based on the consistency features.Extensive experiments on DGM4 datasets prove that CSCL …
Poster
Davide Talon · Federico Girella · Ziyue Liu · Marco Cristani · Yiming Wang

[ ExHall D ]

Abstract
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language.Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis.Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases.On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a …
Poster
Zengrong Lin · Zheng Wang · Tianwen Qian · Pan Mu · Sixian Chan · Cong Bai

[ ExHall D ]

Abstract
Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications.
Poster
Xin Zhang · Yanzhao Zhang · Wen Xie · Mingxin Li · Ziqi Dai · Dingkun Long · Pengjun Xie · Meishan Zhang · Wenjie Li · Min Zhang

[ ExHall D ]

Abstract
Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text or images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that a more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling, training strategies, and perform ablation studies on both the model and synthetic data.
Poster
Davide Caffagni · Sara Sarto · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara

[ ExHall D ]

Abstract
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that Ret achieves state-of-the-art performance across diverse settings.
Poster
Qirui Jiao · Daoyuan Chen · Yilun Huang · Bolin Ding · Yaliang Li · Ying Shen

[ ExHall D ]

Abstract
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel data synthesis method, leveraging insights from contrastive learning and image difference captioning to enhance fine-grained image recognition in MLLMs. By analyzing object differences in detailed regions between similar images, we challenge the model to identify both matching and distinct components. Specifically, our method initially create pairs of similar images that highlight object variations. After that, we introduce a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for differences describing. The outcome is a high-quality dataset of "object replacement" samples, named Img-Diff, which can be expanded as needed due to its automation. We use the generated dataset to finetune state-of-the-art (SOTA) MLLMs such as InternVL2, yielding comprehensive improvements across numerous image difference and Visual Question Answering tasks. For instance, the trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Additionally, we conduct thorough evaluations to confirm the dataset's diversity, quality, and robustness, presenting several insights on the synthesis of such a contrastive dataset. We release our codes and dataset to encourage further research on multimodal data synthesis and MLLMs' fundamental capabilities for image understanding.
Poster
Reza Abbasi · Ali Nazari · Aminreza Sefid · Mohammadali Banayeeanzade · Mohammad Rohban · Mahdieh Baghshah

[ ExHall D ]

Abstract
Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images.
Poster
Yifei Zhang · Chang Liu · Jin Wei · Xiaomeng Yang · Yu ZHOU · Can Ma · Xiangyang Ji

[ ExHall D ]

Abstract
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplementary for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively …
Poster
Dongliang Luo · Hanshen Zhu · Ziyang Zhang · Dingkang Liang · Xudong Xie · Yuliang Liu · Xiang Bai

[ ExHall D ]

Abstract
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7\%, +5.6\%, and +2.6\% H-mean under 0.5\%, 1\%, and 2\% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0\%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text …
Poster
Seil Kang · Jinyeong Kim · Junhyeok Kim · Seong Jae Hwang

[ ExHall D ]

Abstract
Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual grounding, though they inevitably require fine-tuning and additional model components to explicitly generate bounding boxes or segmentation masks. However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities. We refer to these heads, which consistently capture object locations related to text semantics, as localization heads. Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. Surprisingly, only three out of thousands of attention heads are sufficient to achieve competitive localization performance compared to existing LVLM-based visual grounding methods that require fine-tuning. Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship, as they implicitly focus on relevant image regions to generate informative text outputs. All the source codes will be made available to the public.
Poster
Teng Hu · Jiangning Zhang · Ran Yi · Jieyu Weng · Yabiao Wang · Xianfang Zeng · Xuezhucun Xue · Lizhuang Ma

[ ExHall D ]

Abstract
Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present **IAR**, an **I**mproved **A**uto**R**egressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model …
Poster
Hongrui Jia · Chaoya Jiang · Haiyang Xu · Wei Ye · Mengfan Dong · Ming Yan · Ji Zhang · Fei Huang · Shikun Zhang

[ ExHall D ]

Abstract
As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.
Poster
Mohsen Gholami · Mohammad Akbari · Kevin Cannons · Yong Zhang

[ ExHall D ]

Abstract
In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP\#) by an average of 21% on image- and video-language benchmarks. The code is provided in the supplementary materials.
Poster
Hao Yin · Guangzong Si · Zilei Wang

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual information remains unclear. In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within the visual modality. Based on these insights, we propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance. Our findings offer a new understanding of visual information processing in MLLMs and provide a state-of-the-art solution for efficient inference.
Poster
Saeed Ranjbar Alvar · Gursimran Singh · Mohammad Akbari · Yong Zhang

[ ExHall D ]

Abstract
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected tokens is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with …
Poster
Longrong Yang · Dong Shen · Chaoxiang Cai · Kaibing Chen · Fan Yang · Tingting Gao · Di ZHANG · Xi Li

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in recent years. However, the expensive inference cost limits the realistic deployment of LVLMs. Some works find that visual tokens are redundant and compress tokens to reduce the inference cost. These works identify important non-redundant tokens as target tokens, then prune the remaining tokens (non-target tokens) or merge them into target tokens. However, target token identification faces the token importance-redundancy dilemma. Besides, token merging and pruning face a dilemma between disrupting target token information and losing non-target token information. To solve these problems, we propose a novel visual token compression scheme, named Libra-Merging. In target token identification, Libra-Merging selects the most important tokens from spatially discrete intervals, achieving a more robust token importance-redundancy trade-off than relying on a hyper-parameter. In token compression, when non-target tokens are dissimilar to target tokens, Libra-Merging does not merge them into the target tokens, thus avoiding disrupting target token information. Meanwhile, Libra-Merging condenses these non-target tokens into an information compensation token to prevent losing important non-target token information. Our method can serve as a plug-in for diverse LVLMs, and extensive experimental results demonstrate its effectiveness. The code will be publicly available.
Poster
Yiyang Du · Xiaochen Wang · Chi Chen · Jiabo Ye · Yiru Wang · Peng Li · Ming Yan · Ji Zhang · Fei Huang · Zhifang Sui · Maosong Sun · Yang Liu

[ ExHall D ]

Abstract
Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
Poster
zefeng zhang · Hengzhu Tang · Jiawei Sheng · Zhenyu Zhang · YiMing Ren · Zhenyang Li · Dawei Yin · Duohe Ma · Tingwen Liu

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) excel in various tasks, yet often struggle with modality bias, tending to rely heavily on a single modality or prior knowledge when generating responses. In this paper, we propose a debiased preference optimization dataset, RLAIF-V-Bias, and introduce a Noise-Aware Preference Optimization (NAPO) algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, prompting the model to overly rely on a specific modality when generating responses. To address the inevitable noise in automatically constructed data, we combine the noise-robust Mean Absolute Error (MAE) with the Binary Cross-Entropy (BCE) in Direct Preference Optimization (DPO) using a negative Box-Cox transformation and dynamically adjust the algorithm’s noise robustness based on the evaluated noise levels in the data.Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.
Poster
Mingyang Song · Xiaoye Qu · Jiawei Zhou · Yu Cheng

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation.Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced.Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored.In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts.Based on the above observation, we propose an Adaptive Data Refinement Framework (ADR), which consists of two stages: Data Rebalancing (DR) and Data Synthesis (DS).In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions.Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume. Our code and data will be …
Poster
Yinan Liang · Ziwei Wang · Xiuwei Xu · Jie Zhou · Jiwen Lu

[ ExHall D ]

Abstract
While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench …
Poster
Qizhou Chen · Chengyu Wang · Dakan Wang · Taolin Zhang · Wangyue Li · Xiaofeng He

[ ExHall D ]

Abstract
Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a lifelong vision language model edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each …
Poster
Zuopeng Yang · Jiluan Fan · Anli Yan · Erdun Gao · Xin Lin · Tao Li · Kanghua Mo · Changyu Dong

[ ExHall D ]

Abstract
Multimodal Large Language Models (MLLMs) bridge the gap between visual and textual data, enabling a range of advanced applications. However, complex internal interactions among visual elements and their alignment with text can introduce vulnerabilities, which may be exploited to bypass safety mechanisms. To address this, we analyze the relationship between image content and task and find that the complexity of subimages, rather than their content, is key. Building on this insight, we propose the Distraction Hypothesis, followed by a novel framework called Contrasting Subimage Distraction Jailbreaking (CS-DJ), to achieve jailbreaking by disrupting MLLMs alignment through multi-level distraction strategies. CS-DJ consists of two components: structured distraction, achieved through query decomposition that induces a distributional shift by fragmenting harmful prompts into sub-queries, and visual-enhanced distraction, realized by constructing contrasting subimages to disrupt the interactions among visual elements within the model. This dual strategy disperses the model’s attention, reducing its ability to detect and mitigate harmful content. Extensive experiments across five representative scenarios and four popular closed-source MLLMs, including GPT-4o-mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash, demonstrate that CS-DJ achieves average success rates of 52.40\% for the attack success rate and 74.10\% for the ensemble attack success rate. These results reveal the potential of distraction-based …
Poster
Siyuan Liang · Jiawei Liang · Tianyu Pang · Chao Du · Aishan Liu · Mingli Zhu · Xiaochun Cao · Dacheng Tao

[ ExHall D ]

Abstract
Instruction tuning enhances large vision-language models (LVLMs) but increases their vulnerability to backdoor attacks due to their open design. Unlike prior studies in static settings, this paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains. We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness under visual and text domain shifts. Our findings reveal two insights: (1) backdoor generalizability improves when distinctive trigger patterns are independent of specific data domains or model architectures, and (2) triggers placed in preference over clean semantic regions significantly enhance attack generalization. Based on these insights, we propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas using attributional interpretation. Experiments with OpenFlamingo, Blip-2, and Otter show that MABA significantly boosts the attack success rate of generalization by 36.4\%, achieving a 97\% success rate at a 0.2\% poisoning rate. This study reveals limitations in current evaluations and highlights how enhanced backdoor generalizability poses a security threat to LVLMs, even without test data access.
Poster
Xin Luo · Fu Xueming · Zihang Jiang · S Kevin Zhou

[ ExHall D ]

Abstract
The increasing adoption of large-scale models under 7 billion parameters in both language and vision domains enables inference tasks on a single consumer-grade GPU but makes fine-tuning models of this scale, especially 7B models, challenging. This limits the applicability of pruning methods that require full fine-tuning. Meanwhile, pruning methods that do not require fine-tuning perform well at low sparsity levels (10%-50%) but struggle at mid-to-high sparsity levels (50%-70%), where the error behaves equivalently to that of semi-structured pruning. To address these issues, this paper introduces ICP, which finds a balance between full fine-tuning and zero fine-tuning. First, Sparsity Rearrange is used to reorganize the predefined sparsity levels, followed by Block-wise Compensate Pruning, which alternates pruning and compensation on the model’s backbone, fully utilizing inference results while avoiding full model fine-tuning. Experiments show that ICP improves performance at mid-to-high sparsity levels compared to baselines, with only a slight increase in pruning time and no additional peak memory overhead.
Poster
Lianyu Wang · Meng Wang · Huazhu Fu · Daoqiang Zhang

[ ExHall D ]

Abstract
Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP’s capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application …
Poster
Wang Xuan · Xitong Gao · Dongping Liao · Tianrui Qin · Yu-liang Lu · Cheng-Zhong Xu

[ ExHall D ]

Abstract
In the age of pervasive machine learning applications, protecting digital content from unauthorized use has become a pressing concern. Unlearnable examples (UEs), i.e., data modified with imperceptible perturbations to inhibit model training while preserving human usability, have emerged as a promising approach. However, existing UE methods assume unauthorized trainers have extensive exposure to UEs or that models are trained from scratch, which may not hold in practical scenarios, This paper investigates the effectiveness of UEs under the few-shot learning paradigm, pitching it against visual prompt learning (VPL) models that leverage pretrained vision-language models (VLMs), like CLIP, capable of generalizing to new classes with minimal data. To address this, we introduce an adaptive UE framework to generate unlearnable examples that specifically target the VPL process. In addition, we propose a novel UE countermeasure, A3, with cross-modal adversarial feature alignment, specifically designed to circumvent UEs under few-shot VPL. Experimental evaluations on 7 datasets show that A3 outperforms existing VPL methods, achieving up to 33% higher performance in learning from UEs. For example, in the scenario involving -bounded EM perturbations, A3 has an average harmonic mean accuracy across 7 datasets of 82.43%, compared to CoCoOp's baseline of 65.47%. Our findings highlight the limitations …
Poster
Zequn Zeng · Yudi Su · Jianqiao Sun · Tiansheng Wen · Hao Zhang · Zhengjue Wang · Bo Chen · Hongwei Liu · Jiawei Ma

[ ExHall D ]

Abstract
Concept-based models are inherently interpretable, as they map black-box representations to human-understandable concepts, making the decision-making process more transparent. These models allow users to understand the reasoning behind predictions, which is crucial for high-stakes applications. However, they often introduce domain-specific concepts that contribute to the final predictions, which can undermine their generalization capabilities. In this paper, we propose a novel Language-guided Concept-Erasing (lanCE) framework. Specifically, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via a large domain descriptor set. Based on these findings, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. To simulate a wide range of unseen visual domains, we generate a set of domain descriptors by prompting large language models (LLMs). Notably, our proposed DDO regularizer is agnostic to the design of concept-based models and thus can be widely integrated into various such models. By integrating the proposed DDO regularizer into several prevailing models and evaluating them on two standard domain generalization benchmarks and three new benchmarks introduced in this paper, we demonstrate that DDO loss can significantly improve the out-of-distribution (OOD) generalization capabilities over the previous state-of-the-art …
Poster
Xueqing Wu · Yuheng Ding · Bingxuan Li · Pan Lu · Da Yin · Kai-Wei Chang · Nanyun Peng

[ ExHall D ]

Abstract
The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features **dense** and **fine-grained** critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. \ourscritic{} significantly improves critique and correction performance by up to 13.5%.
Poster
Qiyuan Dai · Sibei Yang

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data.Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access—a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution …
Poster
Mingjin Zhang · Xiaolong Li · Fei Gao · Jie Guo · Xinbo Gao · Jing Zhang

[ ExHall D ]

Abstract
Infrared Small Target Detection (IRSTD) aims to identify low signal-to-noise ratio small targets in infrared images with complex backgrounds, which is crucial for various applications. However, existing IRSTD methods typically rely solely on image modalities for processing, which fail to fully capture contextual information, leading to limited detection accuracy and adaptability in complex environments. Inspired by vision-language models, this paper proposes a novel framework, SAIST, which integrates textual information with image modalities to enhance IRSTD performance. The framework consists of two main components: Scene Recognition Contrastive Language-Image Pretraining (SR-CLIP) and CLIP-guided Segment Anything Model (CG-SAM). SR-CLIP generates a set of visual descriptions through object-object similarity and object-scene relevance, embedding them into learnable prompts to refine the textual description set. This reduces the domain gap between vision and language, generating precise textual and visual prompts. CG-SAM utilizes the prompts generated by SR-CLIP to accurately guide the Mask Decoder in learning prior knowledge of background features, while incorporating infrared imaging equations to improve small target recognition in complex backgrounds and significantly reduce the false alarm rate. Additionally, this paper introduces the first multimodal IRSTD dataset, MIRSTD, which contains abundant image-text pairs. Experimental results demonstrate that the proposed SAIST method outperforms existing state-of-the-art …
Poster
Changsong Wen · Zelin Peng · Yu Huang · Xiaokang Yang · Wei Shen

[ ExHall D ]

Abstract
Domain generalization (DG) aims to train a model on source domains that can generalize well to unseen domains. Recent advances in Vision-Language Models (VLMs), such as CLIP, exhibit remarkable generalization capabilities across a wide range of data distributions, benefiting tasks like DG. However, CLIP is pre-trained by aligning images with their descriptions, which inevitably captures domain-specific details. Moreover, adapting CLIP to source domains with limited feature diversity introduces bias. These limitations hinder the model's ability to generalize across domains. In this paper, we propose a new DG approach by learning with diverse text prompts. These text prompts incorporate varied contexts to imitate different domains, enabling DG model to learn domain-invariant features. The text prompts guide DG model learning in three aspects: feature suppression, which uses these prompts to identify domain-sensitive features and suppress them; feature consistency, which ensures the model's features are robust to domain variations imitated by the diverse prompts; and feature diversification, which diversifies features based on the prompts to mitigate bias. Experimental results show that our approach improves domain generalization performance on five datasets on the DomainBed benchmark, achieving state-of-the-art results.
Poster
Junyi Wu · Yan Huang · Min Gao · Yuzhen Niu · Yuzhong Chen · Qiang Wu

[ ExHall D ]

Abstract
Pedestrian attribute recognition (PAR) seeks to predict multiple semantic attributes associated with a specific pedestrian. There are two types of approaches for PAR: unimodal framework and bimodal framework. The former one is to seek a robust visual feature. However, the lack of exploiting semantic feature of linguistic modality is the main concern. The latter one adopts utilizes prompt learning techniques to integrate linguistic data. However, static prompt templates and simple bimodal concatenation cannot to capture the extensive intra-class attribute variability and support active modalities collaboration. In this paper, we propose an Enhanced Visual-Semantic Interaction with Tailored Prompts (EVSITP) framework for PAR. We present an Image-Conditional Dual-Prompt Initialization Module (IDIM) to adaptively generate context-sensitive prompts from visual inputs. Subsequently, a Prompt Enhanced and Regularization Module (PERM) is proposed to strengthen linguistic information from IDIM. We further design a Bimodal Mutual Interaction Module (BMIM) to ensure bidirectional modalities communication. In addition, existing PAR datasets are collected over a short period in limited scenarios, which do not align with real-world scenarios. Therefore, we annotate a long-term person re-identification dataset to create a new PAR dataset, Celeb-PAR. Experiments on several challenging PAR datasets show that our method outperforms state-of-the-art approaches.
Poster
Zilin Xiao · Pavel Suma · Ayush Sachdeva · Hao-Jen Wang · Giorgos Kordopatis-Zilos · Giorgos Tolias · Vicente Ordonez

[ ExHall D ]

Abstract
We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having significantly lower forward latency compared with pair-wise local feature re-rankers.
Poster
Jie Wang · Nana Yu · Zihao Zhang · Yahong Han

[ ExHall D ]

Abstract
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, …
Poster
Nuo Chen · Ming Jiang · Qi Zhao

[ ExHall D ]

Abstract
Deep saliency models, which predict what parts of an image capture our attention, are often like black boxes. This limits their use, especially in areas where understanding why a model makes a decision is crucial. Our research tackles this challenge by building a saliency model that can not only identify what is important in an image, but also explain its choices in a way that makes sense to humans. We achieve this by using vision-language models to reason about images and by focusing the model's attention on the most crucial information using a contextual prioritization mechanism. Unlike prior approaches that rely on fixation descriptions or soft-attention based semantic aggregation, our method directly models the reasoning steps involved in saliency prediction, generating selectively prioritized explanations clarify why specific regions are prioritized. Comprehensive evaluations demonstrate the effectiveness of our model in generating high-quality saliency maps and coherent, contextually relevant explanations. This research is a step towards more transparent and trustworthy AI systems that can help us understand and navigate the world around us.
Poster
Ziheng Zhang · Jianyang Gu · Arpita Chowdhury · Zheda Mai · David Carlyn · Tanya Berger-Wolf · Yu Su · Wei-Lun Chao

[ ExHall D ]

Abstract
Class activation map (CAM) has been broadly investigated to highlight image regions that contribute to class predictions. Despite its simplicity to implement and computational efficiency, CAM often struggles to identify discriminative regions in fine-grained classification tasks. Previous efforts address this by introducing more sophisticated explanation processes on the model prediction, at a cost of extra complexity. In this paper, we propose Finer-CAM, which retains CAM's efficiency while achieving precise localization of discriminative regions. Our insight is that the deficiency of CAM is not about how it explains but **what it explains**. Previous methods independently look at all the possible hints that explain the target class prediction, which inevitably also activates regions predictive of similar classes. By explicitly comparing and spotting the difference between the target class and similar classes, Finer-CAM suppresses features shared with other classes, and emphasizes the unique key details of the target class. The method allows adjustable comparing strength, enabling Finer-CAM to focus coarsely on general object contours or discriminatively on fine-grained details. The method is compatible with various CAM methods, and can be extended to multi-modal models to accurately activate specific concepts. Quantitatively, we demonstrate that masking out the top 5% activated pixels by Finer-CAM leads …
Poster
Junru Zhao · Tianqin Li · Dunhan Jiang · Shenghao Wu · Alan Ramirez · Tai Sing Lee

[ ExHall D ]

Abstract
David Marr’s seminal theory of human perception proposes that visual representation learning should follow a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and shortcuts learning such as texture bias. In this work, we demonstrate that leveraging Marr’s multi-stage theory—by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics—leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.
Poster
Baifeng Shi · Boyi Li · Han Cai · Yao Lu · Sifei Liu · Marco Pavone · Jan Kautz · Song Han · Trevor Darrell · Pavlo Molchanov · Danny Yin

[ ExHall D ]

Abstract
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 384x384) due to the quadratic cost of processing larger images. We introduce PS3, for Pre-training with Scale-Selective Scaling, that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of processing entire global images, PS3 is pre-trained to selectively process local regions and contrast them with local detailed captions, allowing it to learn detailed representation at high resolution with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global low-resolution image and select local high-resolution regions to process based on their saliency or relevance to a text prompt. When applied to multi-modal LLMs (MLLMs), PS3 demonstrates performance that effectively scales with the pre-training resolution and significantly improves over baselines without high-resolution pre-training. We also find current benchmarks do not require recognizing details at 4K resolution, which motivates us to propose 4KPro, a new benchmark that evaluates visual perception at 4K resolution, on which PS3 outperforms state-of-the-art MLLMs, including a 13% improvement over GPT-4o.
Poster
Enrico Fini · Mustafa Shukor · Xiujun Li · Philipp Dufter · Michal Klein · David Haldimann · Sai Aitharaju · Victor Guilherme Turrisi da Costa · Louis Béthune · Zhe Gan · Alexander Toshev · Marcin Eichner · Moin Nabi · Yinfei Yang · Joshua Susskind · Alaaeldin El-Nouby

[ ExHall D ]

Abstract
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Fur- thermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal im- age understanding across diverse settings.
Poster
Tianran Chen · Jiarui Chen · Baoquan Zhang · Zhehao Yu · Shidong Chen · Rui Ye · Xutao Li · Yunming Ye

[ ExHall D ]

Abstract
Parameter-Efficient Fine-Tuning (PEFT) is a fundamental research problem in computer vision, which aims to tune a few of parameters for efficient storage and adaptation of pre-trained vision models. Recently, sensitivity-aware parameter efficient fine-tuning method (SPT) addresses this problem by identifying sensitive parameters and then leveraging its sparse characteristic to combine unstructured and structured tuning for PEFT. However, existing methods only focus on sparse characteristic of sensitive parameters but overlook its distribution characteristic, which results in additional storage burden and limited performance improvement. In this paper, we find that the distribution of sensitive parameters is not chaotic, but concentrates in a small number of rows or columns in each parameter matrix. Inspired by this fact, we propose a Compact Dynamic-Rank Adaptation-based tuning method for Sensitivity-aware Parameter efficient fine-Tuning, called CDRA-SPT. Specifically, we first identify the sensitive parameters that require tuning for each downstream task. Then, we reorganize the sensitive parameters by following its row and column into a compact sub-parameter matrix. Finally, a dynamic-rank adaptation is designed and applied at sub-parameter matrix level for PEFT. Its advantage is that the dynamic-rank characteristic of sub-parameter matrix can be fully exploited for PEFT. Extensive experiments show that our method achieves superior performance over …
Poster
Long Zhou · Fereshteh Shakeri · Aymen Sadraoui · Mounir Kaaniche · Jean-Christophe Pesquet · Ismail Ben Ayed

[ ExHall D ]

Abstract
Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the pre-diction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as “learning to optimize”, in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyperparameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively. The source code and learned …
Poster
Yunze Liu · Li Yi

[ ExHall D ]

Abstract
Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. We will release the code and checkpoints.
Poster
Zhuguanyu Wu · Jiayi Zhang · Jiaxin Chen · Jinyang Guo · Di Huang · Yunhong Wang

[ ExHall D ]

Abstract
Vision Transformers (ViTs) have become one of the most commonly used visual backbones. While ViTs generally achieve higher accuracy than equivalently sized CNNs, they often experience significant accuracy loss during quantized deployment, especially with ultra-low bit post-training quantization (PTQ) schemes. Current reconstruction-based PTQ methods commonly used for CNNs experience significant accuracy drops on ViTs, due to the inaccurate estimation of output importance and the large accuracy drop in quantizing post-GELU activation. To address these issues, we propose APHQ-ViT, a new PTQ paradigm based on average perturbation Hessian (APH) importance estimation: to assess output importance, we thoroughly analyzed the approximations used in previous works with Hessian loss and the reasons for their low accuracy, and proposed a more precise average perturbation Hessian loss. To address the issue of quantizing post-GELU activation, we proposed an MLP-Reconstruction (MR) technique that reconstructs the MLP using the average perturbation Hessian importance estimation. As well as reducing the activation range, MLP-Reconstruction replaces the activation function of the MLP from GELU to ReLU with the small unlabeled calibration set, further enhancing the model's accuracy. With only linear quantizers, APHQ-ViT outperforms existing methods with a substantial margin on 3-bit and 4-bit PTQ of ViTs in different vision tasks.
Poster
Yoojin Jung · Byung Cheol Song

[ ExHall D ]

Abstract
Deep learning-based computer vision systems adopt complex and large architectures to improve performance, yet they face challenges in deployment on resource-constrained mobile and edge devices. To address this issue, model compression techniques such as pruning, quantization, and matrix factorization have been proposed; however, these compressed models are often highly vulnerable to adversarial attacks. We introduce the Efficient Ensemble Defense (EED) technique, which diversifies the compression of a single base model based on different pruning importance scores and enhances ensemble diversity to achieve high adversarial robustness and resource efficiency. EED dynamically determines the number of necessary sub-models during the inference stage, minimizing unnecessary computations while maintaining high robustness. On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness performance compared to existing adversarial pruning techniques, along with an inference speed improvement of up to 1.86 times. This proves that EED is a powerful defense solution in resource-constrained environments.
Poster
Zhaozhi Wang · Yue Liu · Yunjie Tian · Yunfan Liu · Yaowei Wang · Qixiang Ye

[ ExHall D ]

Abstract
Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer.
Poster
Ao Wang · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding

[ ExHall D ]

Abstract
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models will be publicly available.
Poster
Nikaan Nikzad · YI LIAO · Yongsheng Gao · Jun Zhou

[ ExHall D ]

Abstract
Over the past few years, vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks. However, attempts to enhance their robustness have yielded limited success, mainly focusing on different training strategies, input patch augmentation, or network structural enhancements. These approaches often involve extensive training and fine-tuning, which are time-consuming and resource-intensive. To tackle these obstacles, we introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA). By harnessing spatial relationships between token features, SATA enhances both the representational capacity and robustness of ViT models. This is achieved through the analysis and grouping of tokens according to their spatial autocorrelation scores prior to their input into the Feed-Forward Network (FFN) block of the self-attention mechanism. Importantly, SATA seamlessly integrates into existing pre-trained ViT baselines without requiring retraining or additional fine-tuning, while concurrently improving efficiency by reducing the computational load of the FFN units. Experimental results show that the baseline ViTs enhanced with SATA not only achieve a new state-of-the-art top-1 accuracy on ImageNet-1K image classification (94.9\%) but also establish new state-of-the-art performance across multiple robustness benchmarks, including ImageNet-A (top-1=63.6\%), ImageNet-R (top-1=79.2\%), and ImageNet-C (mCE=13.6\%), all without requiring additional training or fine-tuning of baseline models.
Poster
Benjamin Bergner · Christoph Lippert · Aravindh Mahendran

[ ExHall D ]

Abstract
The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput.To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens.However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks.In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance.These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner.We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance.As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
Poster
Joshua Fixelle

[ ExHall D ]

Abstract
Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves state-of-the-art performance on image classification and competitive performance on image retrieval, positioning it as an efficient and adaptable framework for semantic-based vision tasks.
Poster
Zhijie Zhu · Lei Fan · Maurice Pagnucco · Yang Song

[ ExHall D ]

Abstract
Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability.
Poster
Fanding Huang · Jingyan Jiang · Qinting Jiang · Li Hebei · Faisal Nadeem Khan · Zhi Wang

[ ExHall D ]

Abstract
Vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose \textbf{COSMIC} (\underline{C}lique-\underline{O}riented \underline{S}emantic \underline{M}ulti-space \underline{I}ntegration for \underline{C}LIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: \textit{Dual Semantics Graph} (DSG) and \textit{Clique Guided Hyper-class} (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81\% gain on out-of-distribution tasks and 5.33\% on cross-domain generation with CLIP RN-50.
Poster
Jiho Choi · Seonho Lee · Minhyun Lee · Seungho Lee · Hyunjung Shim

[ ExHall D ]

Abstract
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained object parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
Poster
Vladan Stojnić · Yannis Kalantidis · Jiri Matas · Giorgos Tolias

[ ExHall D ]

Abstract
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better captures these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance across a diverse set of datasets.
Poster
Reza Qorbani · Gianluca Villani · Theodoros Panagiotakopoulos · Marc Botet Colomer · Linus Härenstam-Nielsen · Mattia Segu · Pier Luigi Dovesi · Jussi Karlgren · Daniel Cremers · Federico Tombari · Matteo Poggi

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world application. We introduce Semantic Library Adaptation (SemLa), a novel framework for training-free, test-time domain adaptation. SemLa leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on an 18-domain benchmark built over 10 standard datasets demonstrate SemLa's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
Poster
Hritam Basak · Zhaozheng Yin

[ ExHall D ]

Abstract
Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain …
Poster
Xiangfeng Xu · Pinyi Zhang · Wenxuan Huang · Yunhang Shen · Haosheng Chen · Jingzhong Lin · Wei Li · Gaoqi He · Jiao Xie · Shaohui Lin

[ ExHall D ]

Abstract
Weakly supervised semantic segmentation (WSSS) has garnered considerable attention due to its effective reduction of annotation costs. Most approaches utilize Class Activation Maps (CAM) to produce pseudo-labels, thereby localizing target regions using only image-level annotations. However, the prevalent methods relying on vision transformers (ViT) encounter an "over-expansion" issue, i.e., CAM incorrectly expands high activation value from the target object to the background regions, as it is difficult to learn pixel-level local intrinsic inductive bias in ViT from weak supervisions. To solve this problem, we propose a Progressive Confidence Region Expansion (PCRE) framework for WSSS, it gradually learns a faithful mask over the target region and utilizes this mask to correct the confusion in CAM. PCRE has two key components: "Confidence Region Mask Expansion" (CRME) and "Class-Prototype Enhancement" (CPE). CRME progressively expands the mask in the small region with the highest confidence, eventually encompassing the entire target, thereby avoiding unintended CPE aims to enhance mask generation in CRME by leveraging the similarity between the learned, dataset-level class prototypes and patch features as supervision to optimize the mask output from CRME. Extensive experiments demonstrate that our method outperforms the existing single-stage and multi-stage approaches on the PASCAL VOC and MS COCO benchmark …
Poster
Hongmei Yin · Tingliang Feng · Fan Lyu · Fanhua Shang · Hongying Liu · Wei Feng · Liang Wan

[ ExHall D ]

Abstract
In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this, this paper proposes EIR, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts present in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model's capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state-of-the-art CSS methods.
Poster
Zhaoyang Li · Yuan Wang · Wangkai Li · Tianzhu Zhang · Xiang Liu

[ ExHall D ]

Abstract
Cross-Domain Few-Shot Segmentation (CD-FSS) extends the generalization ability of Few-Shot Segmentation (FSS) beyond a single domain, enabling more practical applications. However, directly employing conventional FSS methods suffers from severe performance degradation in cross-domain settings, primarily due to feature sensitivity and support-to-query matching process sensitivity across domains. Existing methods for CD-FSS either focus on domain adaptation of features or delve into designing matching strategies for enhanced cross-domain robustness. Nonetheless, they overlook the fact that these two issues are interdependent and should be addressed jointly. In this work, we tackle these two issues within a unified framework by optimizing features in the frequency domain and enhancing the matching process in the spatial domain, working jointly to handle the deviations introduced by the domain gap. To this end, we proposed a coherent Dual-Agent Optimization (DATO) framework, including a consistent mutual aggregation (CMA) and a correlation rectification strategy (CRS). In the consistent mutual aggregation module, we employ a set of agents to learn domain-invariant features across domains, and then use these features to enhance the original representations for feature adaptation. In the correlation rectification strategy, the agent-aggregated domain-invariant features serve as a bridge, transforming the support-to-query matching process into a referable feature space and …
Poster
Dongyao Jiang · Haodong Jing · Yongqiang Ma · Nanning Zheng

[ ExHall D ]

Abstract
Human reasoning naturally combines concepts to identify unseen compositions, a capability that Compositional Zero-Shot Learning (CZSL) aims to replicate in machine learning models. However, we observe that focusing solely on typical image classification tasks in CZSL may limit models' compositional generalization potential. To address this, we introduce C-EgoExo, a video-based benchmark, along with a compositional action recognition task to enable more comprehensive evaluations. Inspired by human reasoning processes, we propose a Dual-branch Hybrid Discrimination (DHD) framework, featuring two branches that decode visual inputs in distinct observation sequences. Through a cross-attention mechanism and a contextual dependency encoder, DHD effectively mitigates challenges posed by conditional variance. We further design a Copula-based orthogonal decoding loss to counteract contextual interference in primitive decoding. Our approach demonstrates outstanding performance across diverse CZSL tasks, excelling in both image-based and video-based modalities and in attribute-object and action-object compositions, setting a new benchmark for CZSL evaluation.
Poster
Zeliang Zhang · Gaowen Liu · Charles Fleming · Ramana Kompella · Chenliang Xu

[ ExHall D ]

Abstract
Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, which compromises their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class—without access to pre-trained data—while preserving the model’s overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. Our method consists of (1) a forgetting stage to fine-tune on samples to be forgotten, (2) a reminding stage to restore the model’s performance on retaining samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. We also introduce knowledge distillation to handle the distribution disparity between forgetting/retaining samples and the unseen pre-trained data. Extensive experiments demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on other tasks and datasets, outperforming baseline unlearning methods.
Poster
Yiyang Chen · Tianyu Ding · Lei Wang · Jing Huo · Yang Gao · Wenbin Li

[ ExHall D ]

Abstract
Few-shot Class-Incremental Learning (FSCIL) challenges models to adapt to new classes with limited samples, presenting greater difficulties than traditional class-incremental learning. While existing approaches rely heavily on visual models and require additional training during base or incremental phases, we propose a training-free framework that leverages pre-trained visual-language models like CLIP. At the core of our approach is a novel Bi-level Modality Calibration (BiMC) strategy. Our framework initially performs intra-modal calibration, combining LLM-generated fine-grained category descriptions with visual prototypes from the base session to achieve precise classifier estimation. This is further complemented by inter-modal calibration that fuses pre-trained linguistic knowledge with task-specific visual priors to mitigate modality-specific biases. To enhance prediction robustness, we introduce additional metrics and strategies that maximize the utilization of limited data. Extensive experimental results demonstrate that our approach significantly outperforms existing methods. The code will be made publicly available.
Poster
Yuanpei Liu · Zhenqi He · Kai Han

[ ExHall D ]

Abstract
Generalized Category Discovery (GCD) is an intriguing open-world problem that has garnered increasing attention within the research community. Given a dataset that includes both labelled and unlabelled images, GCD aims to categorize all images in the unlabelled subset, regardless of whether they belong to known or unknown classes. In GCD, the common practice typically involves applying a sphere projection operator at the end of the self-supervised pretrained backbone. This approach leads to subsequent operations, being conducted within Euclidean or spherical geometry. However, both of these geometries have been shown to be suboptimal for encoding samples that possess hierarchical structures. In contrast, hyperbolic space exhibits a unique property where its volume grows exponentially relative to the radius, making it particularly suitable for embedding hierarchical data. In this paper, we are the first to investigate category discovery within hyperbolic space. We propose a simple yet effective framework for learning hierarchy-aware representations and classifiers for GCD through the lens of hyperbolic geometry. Specifically, we initiate this process by transforming a self-supervised backbone pretrained in Euclidean space into hyperbolic space via exponential mapping, then performing all representation and classifier optimization in the hyperbolic domain. We evaluate our proposed framework on widely used parametric and …
Poster
Qianqian Shen · Yunhan Zhao · Nahyun Kwon · Jeeeun Kim · Yanan Li · Shu Kong

[ ExHall D ]

Abstract
Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references. Technically, it requires proposal detection to identify all possible object instances, followed by instance-level matching to pinpoint the ones of interest. Its open-world nature supports its wide-ranging applications from robotics to AR/VR, but also presents significant challenges: methods must generalize to unknown testing data distributions because (1) the testing scene imagery is unseen during training, and (2) there are domain gaps between visual references and detected proposals. Existing methods attempt to tackle these challenges by synthesizing diverse training examples or utilizing off-the-shelf foundation models (FMs). However, they only partially capitalize the available open-world information. In this paper, we approach InsDet from an Open-World perspective, introducing our method IDOW. We find that, while pretrained FMs yield high recall in instance detection, they are not specifically optimized for instance-level feature matching. To address this, we adapt pretrained FMs for improved instance-level matching using open-world data. Our approach incorporates metric learning along with novel data augmentations, which sample distractors as negative examples and synthesize novel-view instances to enrich the visual references. Extensive experiments demonstrate that our method significantly outperforms prior works, achieving >10 …
Poster
Yun Zhu · Le Hui · Hang Yang · Jianjun Qian · Jin Xie · Jian Yang

[ ExHall D ]

Abstract
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections in sparse supervised 3D object detection through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78\%, 90\%, and 96\% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method.
Poster
Boyong He · Yuxiang Ji · Qianwen Ye · Zhuoyue Tan · Liaoni Wu

[ ExHall D ]

Abstract
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0\% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9\% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios.
Poster
Chang-Bin Zhang · Yujie Zhong · Kai Han

[ ExHall D ]

Abstract
Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment.In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions.We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network.Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared.This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction.We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction.The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost.We conduct extensive experiments on various baselines, achieving consistent improvements.
Poster
Dennis Jacob · Chong Xiang · Prateek Mittal

[ ExHall D ]

Abstract
Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to single-label classification, limited work has been done for multi-label classification. In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. Our approach is a generalizable method which can provably extend any existing certifiable defense for single-label classification; this is done by considering the multi-label classification task as a series of isolated binary classification problems. In addition, because one patch can only be placed at one location, we further develop a novel certification procedure that provides a tighter robustness certification bound. Using the current state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a backbone, we find that PatchDEMUX can achieve non-trivial robustness on the MSCOCO 2014 validation dataset while maintaining high clean performance.
Poster
Ramchandran Muthukumar · Ambar Pal · Jeremias Sulam · Rene Vidal

[ ExHall D ]

Abstract
State-of-the-art machine learning systems are vulnerable to small perturbations to their input, where _small_ is defined according to a threat model that assigns a positive threat to each perturbation. Most prior works define a task-agnostic, isotropic, and global threat, like the p norm, where the magnitude of the perturbation fully determines the degree of the threat and neither the direction of the attack nor its position in space matter. However, common corruptions in computer vision, such as blur, compression, or occlusions, are not well captured by such treat models. This paper proposes a novel threat model called Projected Displacement (PD) to study robustness beyond existing isotropic and global threat models. The proposed threat model measures the threat of a perturbation via its alignment with _unsafe directions_, defined as directions in the input space along which a perturbation of sufficient magnitude changes the ground truth class label. Unsafe directions are identified locally for each input based on observed training data. In this way, the PD-threat model exhibits anisotropy and locality. The PD-threat model is computationally efficient and can be easily integrated into existing robustness pipelines. Experiments on Imagenet-1k data indicate that, for any input, the set of perturbations with small PD …
Poster
Kai Mao · Ping Wei · Yiyang Lian · Yangyang Wang · Nanning Zheng

[ ExHall D ]

Abstract
Zero-shot and few-shot anomaly detection have recently garnered widespread attention due to their potential applications. Most current methods require an auxiliary dataset in the same modality as the test images for training, making them ineffective in cross-modal situations. However, cross-modal anomaly detection is an essential task for both its application and research value. Existing methods usually underperform under this cross-modal anomaly detection task. Instead, we propose a framework that can be trained using data from a variety of pre-existing modalities that generalizes well to unseen modalities. The model consists of (1) the Transferable Visual Prototype, which directly learns normal/abnormal semantics in the visual space; (2) a Prototype Harmonization strategy that adaptively utilizes the Transferable Visual Prototypes from various modalities for inference on the unknown modality; (3) a Visual Discrepancy Inference for inference under the few-shot setting, further enhancing performance. In the zero-shot setting, the proposed method achieves AUROC improvements of 4.1\%, 6.1\%, 7.6\%, and 6.8\% over the best competing methods in the RGB, 3D, MRI/CT, and Thermal modalities, respectively. In the few-shot setting, our model also achieves the highest AUROC/AP on ten datasets in four modalities, substantially outperforming existing methods.
Poster
Wei Luo · Yunkang Cao · Haiming Yao · Xiaotian Zhang · Jianan Lou · Yuqi Cheng · Weiming Shen · Wenyong Yu

[ ExHall D ]

Abstract
Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on "comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning …
Poster
wenqiao Li · BoZhong Zheng · Xiaohao Xu · Jinye Gan · Fading Lu · Xiang Li · Na Ni · Zheng Tian · Xiaonan Huang · Shenghua Gao · Yingna Wu

[ ExHall D ]

Abstract
Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection. The dataset and code are publicly available to support further research.
Poster
Shun Wei · Jielin Jiang · Xiaolong Xu

[ ExHall D ]

Abstract
Anomaly detection (AD) is a crucial visual task aimed at recognizing abnormal pattern within samples. However, most existing AD methods suffer from limited generalizability, as they are primarily designed for domain-specific applications, such as industrial scenarios, and often perform poorly when applied to other domains. This challenge largely stems from the inherent discrepancies in features across domains. To bridge this domain gap, we introduce UniNet, a generic unified framework that incorporates effective feature selection and contrastive learning-guided anomaly discrimination. UniNet comprises student-teacher models and a bottleneck, featuring several vital innovations: First, we propose domain-related feature selection, where the student is guided to select and focus on representative features from the teacher with domain-relevant priors, while restoring them effectively. Second, a similarity contrastive loss function is developed to strengthen the correlations among homogeneous features. Meanwhile, a margin loss function is proposed to enforce the separation between the similarities of abnormality and normality, effectively improving the model's ability to discriminate anomalies. Third, we propose a weighted decision mechanism for dynamically evaluating the anomaly score to achieve robust AD. Large-scale experiments on 11 datasets from various domains show that UniNet surpasses state-of-the-art methods.
Poster
Ruihan Xu · Haokui Zhang · Yaowei Wang · Wei Zeng · Shiliang Zhang

[ ExHall D ]

Abstract
The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each method has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above problems, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. Thus we propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code will be released.
Poster
Minh-Tuan Tran · Trung Le · Xuan-May Le · Thanh-Toan Do · Dinh Phung

[ ExHall D ]

Abstract
Dataset distillation has gained popularity as a technique for compressing large datasets into smaller, more efficient representations while retaining essential information for model training. Data features can be broadly divided into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these; some focus solely on class-general features, missing finer instance details, while others concentrate on instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we propose the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves the instance-specific and fine-grained regions in synthetic data while enriching non-critical regions with more class-general information. This approach enables our models to leverage all pixel information to capture both types of features, thereby improving overall performance. Furthermore, we introduce Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying solely on the distance between synthetic data predictions and one-hot encoded labels. Experimental results demonstrate that our NRR-DD achieves state-of-the-art performance on both small-scale and large-scale datasets. Additionally, by storing only two distances per instance, our method achieves comparable results across various settings. Code …
Poster
Shu Yang · Chengting Yu · Lei Liu · Hanzhi Ma · Aili Wang · Erping Li

[ ExHall D ]

Abstract
Spiking Neural Networks (SNNs) have garnered considerable attention as a potential alternative to Artificial Neural Networks (ANNs). Recent studies have highlighted SNNs' potential on large-scale datasets. For SNN training, two main approaches exist: direct training and ANN-to-SNN (ANN2SNN) conversion. To fully leverage existing ANN models in guiding SNN learning, either direct ANN-to-SNN conversion or ANN-SNN distillation training can be employed. In this paper, we propose an ANN-SNN distillation framework from the ANN-to-SNN perspective, designed with a block-wise replacement strategy for ANN-guided learning. By generating intermediate hybrid models that progressively align SNN feature spaces to those of ANN through rate-based features, our framework naturally incorporates rate-based backpropagation as a training method. Our approach achieves results comparable to or better than state-of-the-art SNN distillation methods, showing both training and learning efficiency.
Poster
Zesen Cheng · Hang Zhang · Kehan Li · Sicong Leng · Zhiqiang Hu · Fei Wu · Deli Zhao · Xin Li · Lidong Bing

[ ExHall D ]

Abstract
Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, the full instantiation of the similarity matrix demands substantial GPU memory, making large batch training highly resource-intensive. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into small blocks, avoiding full materialization of the similarity matrix. Additionally, we introduce a multi-level tiling implementation to leverage the hierarchical structure of distributed systems, using ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method significantly reduces GPU memory usage in contrastive loss. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M using only 8 A800 80GB GPUs, without sacrificing accuracy. Compared to state-of-the-art memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.
Poster
Christoforos N. Spartalis · Theodoros Semertzidis · Efstratios Gavves · Petros Daras

[ ExHall D ]

Abstract
This paper presents LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models.LoTUS smooths the prediction probabilities of the model, mitigating its overconfidence that stems from data memorization, up to an information-theoretic bound.We evaluate LoTUS on the Transformer and ResNet18 models, against seven baseline methods, on four public datasets. Beyond established MU benchmarks, we evaluate unlearning on a large-scale dataset (ImageNet1k) which deters retraining, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. Experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. We will share code.
Poster
Yongqi Huang · Peng Ye · Chenyu Huang · Jianjian Cao · Lin Zhang · Baopu Li · Gang Yu · Tao Chen

[ ExHall D ]

Abstract
Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (**De**compose, **R**eplace, and **S**ynthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.
Poster
Xiaohan Qin · Xiaoxing Wang · Junchi Yan

[ ExHall D ]

Abstract
Multi-task learning (MTL) has gained widespread application for its ability to transfer knowledge across tasks, improving resource efficiency and generalization. However, gradient conflicts from different tasks remain a major challenge in MTL. Previous gradient-based and loss-based methods primarily focus on gradient optimization in shared parameters, often overlooking the potential of task-specific parameters. This work points out that task-specific parameters not only capture task-specific information but also influence the gradients propagated to shared parameters, which in turn affects gradient conflicts. Motivated by this insight, we propose ConsMTL, which models MTL as a bi-level optimization problem: in the upper-level optimization, we perform gradient aggregation on shared parameters to find a joint update vector that minimizes gradient conflicts; in the lower-level optimization, we introduce an additional loss for task-specific parameters guiding the k gradients of shared parameters to gradually converge towards the joint update vector. Our design enables the optimization of both shared and task-specific parameters to consistently alleviate gradient conflicts. Extensive experiments show that ConsMTL achieves state-of-the-art performance across various benchmarks with task numbers ranging from 2 to 40, demonstrating its superior performance.
Poster
Hankyul Kang · Gregor Seifer · Donghyun Lee · Jongbin Ryu

[ ExHall D ]

Abstract
According to Ebbinghaus' forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brains can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the Respacing method that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed Respacing method allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a view-batch replay method that guarantees the optimal recall interval, and 2) a view-batch self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that …
Poster
Huiyi Wang · Haodong Lu · Lina Yao · Dong Gong

[ ExHall D ]

Abstract
Continual learning (CL) aims to continually accumulate knowledge from a non-stationary data stream without catastrophic forgetting of learned knowledge, requiring a balance between stability and adaptability. Relying on the generalizable representation in pre-trained models (PTMs), PTM-based CL methods perform effective continual adaptation on downstream tasks by adding learnable adapters or prompts upon the frozen PTMs. However, many existing PTM-based CL methods use restricted adaptation on a fixed set of these modules to avoid forgetting, suffering from limited CL ability. Periodically adding task-specific modules results in linear model growth rate and impaired knowledge reuse. We propose **S**elf-**E**xpansion of pre-trained models with **M**odularized **A**daptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. SEMA automatically decides to reuse or add adapter modules on demand in CL, depending on whether significant distribution shift that cannot be handled is detected at different representation levels. We design modular adapter consisting of a functional adapter and a representation descriptor. The representation descriptors are trained as a distribution shift indicator and used to trigger self-expansion signals. For better composing the adapters, an expandable weighting router is learned jointly for mixture of adapter outputs. SEMA enables better knowledge reuse and sub-linear expansion rate. …
Poster
Bowen Zheng · Da-Wei Zhou · Han-Jia Ye · De-Chuan Zhan

[ ExHall D ]

Abstract
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL).Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks.However, such task-specific features learned from the new task may collide with the old features, leading to misclassification between tasks.Therefore, the expanded model is often encouraged to capture diverse features from the new task, aiming to avoid such collision.However, the existing solution is largely restricted to the samples from the current task, because of the poor accessibility to previous samples.To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex).Firstly, it captures task-agnostic features continually with a separate model, providing extra task-agnostic features for subsequent tasks.Secondly, to obtain useful features from the task-agnostic model for the current task, it aggregates the task-agnostic features with the task-specific feature using a merge attention.Then the aggregated feature is transferred back into the task-specific feature for inference, helping the task-specific model capture diverse features.Extensive experiments show the effectiveness and superiority of TagFex …
Poster
Mohamad Hassan N C · Divyam Gupta · Mainak Singha · SAI BHARGAV RONGALI · Ankit Jha · Muhammad Haris Khan · Biplab Banerjee

[ ExHall D ]

Abstract
We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose \textsc{OSLoPrompt}, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as unknown'' and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that \textsc{OSLoPrompt} establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.
Poster
Huitong Chen · Yu Wang · Yan Fan · Guosong Jiang · Qinghua Hu

[ ExHall D ]

Abstract
Class incremental learning (CIL) aims to enable models to continuously learn new classes without catastrophically forgetting old ones. A promising direction is to learn and use prototypes of classes during incremental updates. Despite simplicity and intuition, we find that such methods suffer from inadequate representation capability and unsatisfied feature overlap. These two factors cause class-wise confusion and limited performance. In this paper, we develop a Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our method employs a lightweight auto-encoder module to learn compact manifold for each class in the latent subspace, constraining samples to be well reconstructed only on the semantically correct auto-encoder. Thus, the representation stability and capability of class distributions are enhanced, alleviating the potential class-wise confusion problem. To further distinguish the overlapped features, we propose a confusion-aware latent space separation loss that ensures samples are closely distributed in their corresponding low-dimensional manifold while keeping away from the distributions of drifted features from other classes. Our method demonstrates stronger representational capacity and discrimination ability by learning disentangled manifolds and reduces class confusion. Extensive experiments on multiple datasets and settings show that CREATE outperforms other state-of-the-art methods up to 5.41%.
Poster
Haopeng Sun · Yingwei Zhang · Lumin Xu · Sheng Jin · Ping Luo · Chen Qian · Wentao Liu · Yiqiang Chen

[ ExHall D ]

Abstract
In real-world applications, deep neural networks may encounter constantly changing environments, where the test data originates from continually shifting unlabeled target domains. This problem, known as Unsupervised Continual Domain Shift Learning (UCDSL), poses practical difficulties. Existing methods for UCDSL aim to learn domain-invariant representations for all target domains. However, due to the existence of adaptivity gap, the invariant representation may theoretically lead to large joint errors. To overcome the limitation, we propose a novel UCDSL method, called Multi-Prototype Modeling (MPM). Our model comprises two key components: (1) Multi-Prototype Learning (MPL) for acquiring domain-specific representations using multiple domain-specific prototypes. MPL achieves domain-specific error minimization instead of enforcing feature alignment across different domains. (2) Bi-Level Graph Enhancer (BiGE) for enhancing domain-level and category-level representations, resulting in more accurate predictions. We provide theoretical and empirical analysis to demonstrate the effectiveness of our proposed method. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets, highlighting its effectiveness and robustness in handling unsupervised continual domain shift learning. Codes will be publicly accessible.
Poster
Dexuan Zhang · Thomas Westfechtel · Tatsuya Harada

[ ExHall D ]

Abstract
Existing domain adaptation systems can hardly be applied to real-world problems with new classes presenting at deployment time, especially regarding source-free scenarios where multiple source domains do not share the label space despite being given a few labeled target data. To address this, we define a novel problem setting: multi-source semi-supervised open-set domain adaptation and propose a learning theory via joint error, effectively tackling strong domain shift. To generalize the algorithm into source-free cases, we introdcue a computationally efficient and architecture-flexible attention-based feature generation module. Extensive experiments on various data sets demonstrate the significant improvement of our proposed algorithm over baselines.
Poster
Chen-Chen Zong · Sheng-Jun Huang

[ ExHall D ]

Abstract
Active learning (AL), which iteratively queries the most informative examples from a large pool of unlabeled candidates for model training, faces significant challenges in the presence of open-set classes. Existing methods either prioritize query examples likely to belong to known classes, indicating low epistemic uncertainty (EU), or focus on querying those with highly uncertain predictions, reflecting high aleatoric uncertainty (AU). However, they both yield suboptimal performance, as low EU corresponds to limited useful information, and closed-set AU metrics for unknown class examples are less meaningful. In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. EAOA features a (C+1)-class detector and a target classifier, incorporating an energy-based EU measure and a margin-based energy loss designed for the detector, alongside an energy-based AU measure for the target classifier. Another crucial component is the target-driven adaptive sampling strategy. It first forms a smaller candidate set with low EU scores to ensure closed-set properties, making AU metrics meaningful. Subsequently, examples with high AU scores are queried to form the final query set, with the candidate set size adjusted adaptively. Extensive experiments show that EAOA achieves state-of-the-art performance while maintaining high query …
Poster
Tianxiang Yin · Ningzhong Liu · Han Sun

[ ExHall D ]

Abstract
Active learning (AL) and semi-supervised learning (SSL) both aim to reduce annotation costs: AL selectively annotates high-value samples from the unlabeled data, while SSL leverages abundant unlabeled data to improve model performance. Although these two appear intuitively compatible, directly combining them remains challenging due to fundamental differences in their frameworks. Current semi-supervised active learning (SSAL) methods often lack theoretical foundations and often design AL strategies tailored to a specific SSL algorithm rather than genuinely integrating the two fields.In this paper, we incorporate AL objectives into the overall risk formulation within the mainstream pseudo-label-based SSL framework, clarifying key differences between SSAL and traditional AL scenarios. To bridge these gaps, we propose a feature re-alignment module that aligns the features of unlabeled data under different augmentations by leveraging clustering and consistency constraints. Experimental results demonstrate that our module enables flexible combinations of SOTA methods from both AL and SSL, yielding more efficient algorithm performance.
Poster
Yijie Liu · Xinyi Shang · Yiqun Zhang · Yang Lu · Chen Gong · Jing-Hao Xue · Hanzi Wang

[ ExHall D ]

Abstract
Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called \textbf{S}emi-supervised \textbf{A}ggregation for \textbf{G}lobally-Enhanced \textbf{E}nsemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://anonymous.4open.science/r/CVPR2025-1926-code.
Poster
Yuchuan Li · Jae-Mo Kang · Il-Min Kim

[ ExHall D ]

Abstract
In real-world AI applications, training datasets are often contaminated, containing a mix of in-distribution (ID) and out-of-distribution (OOD) samples without labels. This contamination poses a significant challenge for developing and training OOD detection models, as nearly all existing methods assume access to a clean training dataset of only ID samples—a condition rarely met in real-world scenarios. Customizing each existing OOD detection method to handle such contamination is impractical, given the vast number of diverse methods designed for clean data. To address this issue, we propose a universal, model-agnostic framework that integrates with nearly all existing OOD detection methods, enabling training on contaminated datasets while achieving high OOD detection accuracy on test datasets. Additionally, our framework provides an accurate estimation of the unknown proportion of OOD samples within the training dataset—an important and distinct challenge in its own right. Our approach introduces a novel dynamic weighting function and transition mechanism within an iterative training structure, enabling both reliable estimation of the OOD sample proportion of the training data and precise OOD detection on test data. Extensive evaluations across diverse datasets, including ImageNet-1k, demonstrate that our framework accurately estimates OOD sample proportions of training data and substantially enhances OOD detection accuracy on …
Poster
Li Li · Huixian Gong · Hao Dong · Tiankai Yang · Zhengzhong Tu · Yue Zhao

[ ExHall D ]

Abstract
Out-of-distribution (OOD) detection is crucial for ensuring the robustness of machine learning models by identifying samples that deviate from the training distribution. While traditional OOD detection has predominantly focused on single-modality inputs, such as images, recent advancements in multimodal models have shown the potential of utilizing multiple modalities (e.g., video, optical flow, audio) to improve detection performance. However, existing approaches often neglect intra-class variability within in-distribution (ID) data, assuming that samples of the same class are perfectly cohesive and consistent. This assumption can lead to performance degradation, especially when prediction discrepancies are indiscriminately amplified across all samples. To address this issue, we propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection that accounts for intra-class variations. Our method dynamically updates class center representations for each class by measuring the variance of similar samples within each batch, enabling tailored adjustments. This approach allows us to intensify prediction discrepancies based on the updated class centers, thereby enhancing the model’s robustness and generalization across different modalities. Extensive experiments on two tasks, five datasets, and nine base OOD algorithms demonstrate that DPU significantly improves OOD detection performances, setting a new state-of-the-art in multimodal OOD detection, including improvements up to 80% …
Poster
Luyuan Xie · Tianyu Luan · Wenyuan Cai · Guochen Yan · Zhaoyu Chen · Nan Xi · Yuejian Fang · Qingni Shen · Zhonghai Wu · Junsong Yuan

[ ExHall D ]

Abstract
Federated learning has wide applications in the medical field. It enables knowledge sharing among different healthcare institutes while protecting patients' privacy. However, existing federated learning systems are typically centralized, requiring clients to upload client-specific knowledge to a central server for aggregation. This centralized approach would integrate the knowledge from each client into a centralized server, and the knowledge would be already undermined during the centralized integration before it reaches back to each client. Besides, the centralized approach also creates a dependency on the central server, which may affect training stability if the server malfunctions or connections are unstable. To address these issues, we propose a decentralized federated learning framework named dFLMoE. In our framework, clients directly exchange lightweight head models with each other. After exchanging, each client treats both local and received head models as individual experts, and utilizes a client-specific Mixture of Experts (MoE) approach to make collective decisions. This design not only reduces the knowledge damage with client-specific aggregations but also removes the dependency on the central server to enhance the robustness of the framework. We validate our framework on multiple medical tasks, demonstrating that our method evidently outperforms state-of-the-art approaches under both model homogeneity and heterogeneity settings.
Poster
Zhuoyao Wang · Fan Yi · Peizhu Gong · Caitou He · Cheng Jin · Weizhong Zhang

[ ExHall D ]

Abstract
Batch normalization (BN) is a standard technique for training deep neural networks. However, its effectiveness in Federated Learning (FL) applications, which typically involve heterogeneous training data and resource-limited clients, is compromised for two reasons. One is the population statistics (i.e., mean and variance) of the heterogeneous datasets differ greatly, which leads to inconsistent BN layers among the client models and finally causes these models to drift further away from each other. The other is estimating statistics from a mini-batch can be inaccurate since the batch size has to be small in resource-limited clients. In this paper, we propose a technique named Population Normalization for FL, in which the statistics are learned as trainable parameters during training instead of calculated from mini-batches directly as BN does. Thus, our normalization layers are homogeneous among the clients and the impact of small batch size is eliminated as the model can be well-trained even when the batch size equals to one. To make our method more flexible in real applications, we investigate the role of the stochastic uncertainty existing in the statistic estimation of BN and show that when a large batch size is available we can inject simple artificial noise into our method …
Poster
Hasin Us Sami · Swapneel Sen · Amit K. Roy-Chowdhury · Srikanth Krishnamurthy · Basak Guler

[ ExHall D ]

Abstract
Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/PEFTLeak/Inversion.
Poster
Jeonghwan Park · Niall McLaughlin · Ihsen Alouani

[ ExHall D ]

Abstract
Adversarial attacks remain a significant threat that can jeopardize the integrity of Machine Learning (ML) models. In particular, query-based black-box attacks can generate malicious noise without having access to the victim model's architecture, making them practical in real-world contexts. The community has proposed several defenses against adversarial attacks, only to be broken by more advanced and adaptive attack strategies. In this paper, we propose a framework that detects if an adversarial noise instance is being generated. Unlike existing stateful defenses that detect adversarial noise generation by monitoring the input space, our approach learns adversarial patterns in the input update similarity space. In fact, we propose to observe a new metric called Delta Similarity (DS), which we show it captures more efficiently the adversarial behavior.We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks, where the adversary is aware of the defense and tries to evade detection. We find that our approach is significantly more robust than existing defenses both in terms of specificity and sensitivity.
Poster
George Kamberov

[ ExHall D ]

Abstract
Many machine learning (ML) classifiers are claimed to outperform humans, but they still make mistakes that humans do not. The most notorious examples of such mistakes are adversarial visual metamers. This paper aims to define and investigate the phenomenon of adversarial Doppelgängers (AD), which includes adversarial visual metamers, and to compare the performance and robustness of ML classifiers to human performance.We find that AD are inputs that are close to each other with respect to a perceptual metric defined in this paper, and show that AD are qualitatively different from the usual adversarial examples. The vast majority of classifiers are vulnerable to AD and robustness-accuracy trade-offs may not improve them. Some classification problems may not admit any AD robust classifiers because the underlying classes are ambiguous. We provide criteria that can be used to determine whether a classification problem is well defined or not; describe of an AD robust classifiers’ structure and attributes; introduce and explore the notions of conceptual entropy and regions of conceptual ambiguity for classifiers that are vulnerable to AD attacks, along with methods to bound the AD fooling rate of an attack. We define the notion of classifiers that exhibit hyper-sensitive behavior, that is, classifiers whose …
Poster
Wei Li · Pin-Yu Chen · Sijia Liu · Ren Wang

[ ExHall D ]

Abstract
Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a significant challenge in identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.
Poster
Keyizhi Xu · Chi Zhang · Zhan Chen · Zhongyuan Wang · Chunxia Xiao · Chao Liang

[ ExHall D ]

Abstract
Multi-exit neural networks represent a promising approach to enhancing model inference efficiency, yet like common neural networks, they suffer from significantly reduced robustness against adversarial attacks. While some defense methods have been raised to strengthen the adversarial robustness of multi-exit neural networks, we identify a long-neglected flaw in the evaluation of previous studies: *simply using a fixed set of exits for attack may lead to an overestimation of their defense capacity*. Based on this finding, our work explores the following three key aspects in the adversarial robustness of multi-exit neural networks: (1) we discover that a mismatch of the network exits used by the attacker and defender is responsible for the overestimated robustness of previous defense methods; (2) by finding the best strategy in a two-player zero-sum game, we propose AIMER as an improved evaluation scheme to measure the intrinsic robustness of multi-exit neural networks; (3) going further, we introduce NEED defense method under the evaluation of AIMER that can optimize the defender's strategy by finding a Nash equilibrium of the game. Experiments over 3 datasets, 7 architectures, 6 attacks and 4 baselines show that AIMER evaluates the robustness 13.52% lower than previous methods under AutoAttack, while the robust performance …
Poster
Hanrui Zhao · Niuniu Qi · Mengxin Ren · Banglong Liu · Shuming Shi · Zhengfeng Yang

[ ExHall D ]

Abstract
Polynomial Lyapunov function V(x) provides mathematically rigorous that converts stability analysis into efficiently solvable optimization problem. Traditional numerical methods rely on user-defined templates, while emerging neural V(x) offer flexibility but exhibit poor generalization yield from naive {\it Square} polynomial networks. In this paper, we propose a novel learning-enabled polynomial V(x) synthesis approach, where a data-driven machine learning process guided by target-based sampling to fit candidate V(x) which naturally compatible with the sum-of-squares (SOS) soundness verification. The framework is structured as an iterative loop between a {\it Learner} and a {\it Verifier}, where the {\it Learner} trains expressive polynomial V(x) network via polynomial expansions, while the {\it Verifier} encodes learned candidates with SOS constraints to identify a real V(x) by solving LMI feasibility test problems. The entire procedure is driven by a high-accuracy counterexample guidance technique to further enhance efficiency. Experimental results demonstrate that our approach outperforms both SMT-based polynomial neural Lyapunov function synthesis and traditional SOS method.
Poster
Jing Wang · Songhe Feng · Kristoffer Knutsen Wickstrøm · Michael C. Kampffmeyer

[ ExHall D ]

Abstract
Most Multi-view Clustering approaches assume that all views are available for clustering. However, this assumption is often unrealistic as views are incrementally accumulated over time, leading to a need for continual multi-view clustering (CMVC) methods. Current approaches to CMVC leverage late fusion-based approaches, where a new model is typically learned individually for each view to obtain the corresponding partition matrix, and then used to update a consensus matrix via a moving average. These approaches are prone to view-specific noise and struggle to adapt to large gaps between different views. To address these shortcomings, we reconsider CMVC from the perspective of domain adaption and propose AdaptCMVC, which learns how to incrementally accumulate knowledge of new views as they become available and prevents catastrophic forgetting. Specifically, a self-training framework is introduced to extend the model to new views, particularly designed to be robust to view-specific noise. Further, to combat catastrophic forgetting, a structure alignment mechanism is proposed to enable the model to explore the global group structure across multiple views. Experiments on several multi-view benchmarks demonstrate the effectiveness of our proposed method on the CMVC task. Our code is released in the supplementary materials.
Poster
Liang Chen · Zhe Xue · Yawen Li · Meiyu Liang · Yan Wang · Anton van den Hengel · Yuankai Qi

[ ExHall D ]

Abstract
Deep multi-view clustering methods utilize information from multiple views to achieve enhanced clustering results and have gained increasing popularity in recent years. Most existing methods typically focus on either inter-view or intra-view relationships, aiming to align information across views or analyze structural patterns within individual views. However, they often incorporate inter-view complementary information in a simplistic manner, while overlooking the complex, high-order relationships within multi-view data and the interactions among samples, resulting in an incomplete utilization of the rich information available. Instead, we propose a multi-scale approach that exploits all of the available information. We first introduce a dual graph diffusion module guided by a consensus graph. This module leverages inter-view information to enhance the representation of both nodes and edges within each view. Secondly, we propose a novel contrastive loss function based on hypergraphs to more effectively model and leverage complex intra-view data relationships. Finally, we propose to adaptively learn fusion weights at the sample level, which enables a more flexible and dynamic aggregation of multi-view information. Extensive experiments on eight datasets show favorable performance of the proposed method compared to state-of-the-art approaches, demonstrating its effectiveness across diverse scenarios.
Poster
David Mildenberger · Paul Hager · Daniel Rueckert · Martin J. Menten

[ ExHall D ]

Abstract
Supervised contrastive learning (SupCon) has proven to be a powerful alternative to the standard cross-entropy loss for classification of multi-class balanced datasets. However, it struggles to learn well-conditioned representations of datasets with long-tailed class distributions. This problem is potentially exacerbated for binary imbalanced distributions, which are commonly encountered during many real-world problems such as medical diagnosis. In experiments on seven binary datasets of natural and medical images, we show that the performance of SupCon decreases with increasing class imbalance. To substantiate these findings, we introduce two novel metrics that evaluate the quality of the learned representation space. By measuring the class distribution in local neighborhoods, we are able to uncover structural deficiencies of the representation space that classical metrics cannot detect. Informed by these insights, we propose two new supervised contrastive learning strategies tailored to binary imbalanced datasets that improve the structure of the representation space and increase downstream classification accuracy over standard SupCon by up to 35%.
Poster
Xingxuan Zhang · Jiansheng Li · Wenjing Chu · junjia hai · Renzhe Xu · Yuqing Yang · Shikai Guan · Jiazheng Xu · Liping Jing · Peng Cui

[ ExHall D ]

Abstract
We investigate the generalization boundaries of current Large Multimodal Models (LMMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that LMMs struggle with generalization beyond common training domains, limiting their direct application without adaptation. To understand the cause of unreliable performance, we analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. Results identify mapping deficiency as the primary hurdle. To address this problem, we show that in-context learning (ICL) can significantly enhance LMMs' generalization, opening new avenues for overcoming generalization barriers. We further explore the robustness of ICL under distribution shifts and show its vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data.
Poster
Xingjian Li · Qiming Zhao · Neelesh Bisht · Mostofa Uddin Uddin · Jin Yu Kim · Bryan Zhang · Min Xu

[ ExHall D ]

Abstract
In recent years, the interpretability of Deep Neural Networks (DNNs) has garnered significant attention, particularly due to their widespread deployment in critical domains like healthcare, finance, and autonomous systems. To address the challenge of understanding how DNNs make decisions, Explainable AI (XAI) methods, such as saliency maps, have been developed to provide insights into the inner workings of these models. This paper introduces DiffCAM, a novel XAI method designed to overcome limitations in existing Class Activation Map (CAM)-based techniques, which often rely on decision boundary gradients to estimate feature importance. DiffCAM differentiates itself by considering the actual data distribution of the reference class, identifying feature importance based on how a target example differs from reference examples. This approach captures the most discriminative features without relying on decision boundaries or prediction results, making DiffCAM applicable to a broader range of models, including foundation models. Through extensive experiments, we demonstrate the superior performance and flexibility of DiffCAM in providing meaningful explanations across diverse datasets and scenarios.
Poster
Renshuai Tao · Haoyu Wang · Yuzhe Guo · Hairong Chen · Li Zhang · Xianglong Liu · Yunchao Wei · Yao Zhao

[ ExHall D ]

Abstract
To detect prohibited items in challenging categories, human inspectors typically rely on images from two distinct views (vertical and side). Can AI detect prohibited items from dual-view X-ray images in the same way humans do? Existing X-ray datasets often suffer from limitations, such as single-view imaging or insufficient sample diversity. To address these gaps, we introduce the Large-scale Dual-view X-ray (LDXray), which consists of 353,646 instances across 12 categories, providing a diverse and comprehensive resource for training and evaluating models. To emulate human intelligence in dual-view detection, we propose the Auxiliary-view Enhanced Network (AENet), a novel detection framework that leverages both the main and auxiliary views of the same object. The main-view pipeline focuses on detecting common categories, while the auxiliary-view pipeline handles more challenging categories using expert models" learned from the main view. Extensive experiments on the LDXray dataset demonstrate that the dual-view mechanism significantly enhances detection performance, e.g., achieving improvements of up to +24.7\%for the challenging category of umbrellas. Furthermore, our results show that AENet exhibits strong generalization across seven different detection models.
Poster
Kang Liu · Zhuoqi Ma · Xiaolu Kang · Yunan Li · Kun XIE · Zhicheng Jiao · Qiguang Miao

[ ExHall D ]

Abstract
Automated radiology report generation offers an effective solution to alleviate radiologists' workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3\% BLEU-4 improvement on MIMIC-CXR, a 5.5\% F1 score improvement on MIMIC-ABN, and a 2.7\% F1 RadGraph improvement on Two-view CXR.
Poster
Yuxuan Sun · Yixuan Si · Chenglu Zhu · Xuan Gong · Kai Zhang · Pingyi Chen · Ye Zhang · Zhongyi Shui · Tao Lin · Lin Yang

[ ExHall D ]

Abstract
The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting.Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni’s ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field. The code and datasets will be made available.
Poster
Amaya Gallagher-Syed · Henry Senior · Omnia Alwazzan · Elena Pontarini · Michele Bombardieri · Costantino Pitzalis · Myles J. Lewis · Michael R Barnes · Luca Rossi · Greg Slabaugh

[ ExHall D ]

Abstract
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry analysis. We present BioX-CPath, an explainable graph neural network architecture for Whole Slide Image classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis (accuracy of 0.90 ±0.019) and Sjogren's Disease (accuracy of 0.84 ±0.018) multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores that align with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Code will be made public upon paper acceptance.
Poster
Junjie Zhou · Jiao Tang · Yingli Zuo · Peng Wan · Daoqiang Zhang · WEI SHAO

[ ExHall D ]

Abstract
The integrative analysis of histopathological images and genomic data has received increasing attention for survival prediction of human cancers. However, the existing studies always hold the assumption that full modalities are available. As a matter of fact, the cost for collecting genomic data is high, which sometimes makes genomic data unavailable in testing samples. A common way of tackling such incompleteness is to generate the genomic representations from the pathology images. Nevertheless, such strategy still faces the following two challenges: (1) The gigapixel whole slide images (WSIs) are huge and thus hard for representation. (2) It is difficult to generate the genomic embeddings with diverse function categories in a unified generative framework. To address the above challenges, we propose a Conditional Latent Differentiation Variational AutoEncoder (LD-CVAE) for robust multimodal survival prediction, even with missing genomic data. Specifically, a Variational Information Bottleneck Transformer (VIB-Trans) module is proposed to learn compressed pathological representations from the gigapixel WSIs. To generate different functional genomic features, we develop a novel Latent Differentiation Variational AutoEncoder (LD-VAE) to learn the common and specific posteriors for the genomic embeddings with diverse functions. Finally, we use the product-of-experts technique to integrate the genomic common posterior and image posterior for …
Poster
Yuxin Li · Zihao Zhu · Yuxiang Zhang · Yifan Chen · Zhibin Yu

[ ExHall D ]

Abstract
Semi-supervised polyp segmentation has made significant progress in recent years as a potential solution for computer-assisted treatment. Since depth images can provide extra information other than RGB images to help segment these problematic areas, depth-assisted polyp segmentation has gained much attention. However, the utilization of depth information is still worth studying. The existing RGB-D segmentation methods rely on depth data in the inference stage, limiting their clinical applications. To tackle this problem, we propose a semi-supervised polyp segmentation framework based on the mean teacher architecture. We establish an auxiliary student network with depth images as input in the training stage, and the proposed depth-guided cross-modal mutual learning strategy is adopted to promote the learning of complementary information between different student networks. At the same time, the high-confidence pseudo-labels generated by the auxiliary student network are used to guide the learning of the main student network from different perspectives, and no depth images are required in the inference phase. In addition, we introduce a depth-guided patch augmentation method to improve the model's learning performance in difficult regions of unlabeled polyp images. Experimental results show that our method achieves state-of-the-art performance under different label conditions on five polyp datasets.
Poster
Suruchi Kumari · Pravendra Singh

[ ExHall D ]

Abstract
Despite the remarkable progress of deep learning-based methods in medical image segmentation, their use in clinical practice remains limited for two main reasons. First, obtaining a large medical dataset with precise annotations to train segmentation models is challenging. Secondly, most current segmentation techniques generate a single deterministic segmentation mask for each image. However, in real-world scenarios, there is often significant uncertainty regarding what defines the correct" segmentation, and various expert annotators might provide different segmentations for the same image. To tackle both of these problems, we propose Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation (AmbiSSL). AmbiSSL combines a small amount of multi-annotator labeled data and a large set of unlabeled data to generate diverse and plausible segmentation maps. Our method consists of three key components: (1) The Diverse Pseudo-Label Generation (DPG) module utilizes multiple decoders, created by performing randomized pruning on the original backbone decoder. These pruned decoders enable the generation of a diverse pseudo-label set; (2) a Semi-Supervised Latent Distribution Learning (SSLDL) module constructs acommon latent space by utilizing both ground truth annotations andpseudo-label set; and (3) a Cross-Decoder Supervision (CDS) module, which enables pruned decoders to guide each other’s learning. We evaluated the proposed method on two publicly …
Poster
Shijie Chang · Xiaoqi Zhao · Lihe Zhang · Tiancheng Wang

[ ExHall D ]

Abstract
The recently emerged in-context-learning-based (ICL-based) models have the potential towards the unification of medical lesion segmentation. However, due to their cross-fusion designs, existing ICL-based unified segmentation models fail to accurately localize lesions with low-matched reference sets. Considering that the query itself can be regarded as a high-matched reference, which better indicates the target, we design a self-referencing mechanism that adaptively extracts self-referring indicator vectors from the query based on coarse predictions, thus effectively overcoming the negative impact caused by low-match reference sets. To further facilitate the self-referring mechanism, we introduce reference indicator generation to efficiently extract reference information for coarse predictions instead of using cross-fusion modules, which heavily rely on reference sets. Our designs successfully address the challenges of applying ICL to unified medical lesion segmentation, forming a novel framework named SR-ICL. Our method achieves state-of-the-art results on 8 medical lesion segmentation tasks with only 4 image-mask pairs as reference. Notably, SR-ICL still accomplishes remarkable performance even when using weak reference annotations such as boxes and points, and maintains fixed and low memory consumption even if more tasks are combined. We hope that SR-ICL can provide new insights for the clinical application of medical lesion segmentation.
Poster
Lexin Fang · Yunyang Xu · Xiang Ma · Xuemei Li · Caiming Zhang

[ ExHall D ]

Abstract
Deep learning has achieved significant advancements in medical image segmentation, but existing models still face challenges in accurately segmenting lesion regions. The main reason is that some lesion regions in medical images have unclear boundaries, irregular shapes, and small tissue density differences, leading to label ambiguity. However, the existing model treats all data equally without taking quality differences into account in the training process, resulting in noisy labels negatively impacting model training and unstable feature representations. In this paper, a **d**ata-driven **a**lternate **le**arning (**DALE**) paradigm is proposed to optimize the model's training process, achieving stable and high-precision segmentation. The paradigm focuses on two key points: (1) reducing the impact of noisy labels, and (2) calibrating unstable representations. To mitigate the negative impact of noisy labels, a loss consistency-based collaborative optimization method is proposed, and its effectiveness is theoretically demonstrated. Specifically, the label confidence parameters are introduced to dynamically adjust the influence of labels of different confidence levels during model training, thus reducing the influence of noise labels. To calibrate the learning bias of unstable representations, a distribution alignment method is proposed. This method restores the underlying distribution of unstable representations, thereby enhancing the discriminative capability of fuzzy region representations. Extensive …
Poster
Md Mostafijur Rahman · Radu Marculescu

[ ExHall D ]

Abstract
Recent 3D deep networks such as SwinUNETR, SwinUNETRv2, and 3D UX-Net have shown promising performance by leveraging self-attention and large-kernel convolutions to capture the volumetric context. However, their substantial computational requirements limit their use in real-time and resource-constrained environments. The high #FLOPs and #Params in these networks stem largely fromcomplex decoder designs with high-resolution layers and excessive channel counts. In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages, which sets the number of channels to the minimum needed for accurate feature representation. Additionally, EffiDec3D removes the high-resolution layers when their contribution to segmentation quality is minimal. Our optimized EffiDec3D decoder achieves a 96.4% reduction in #Params and a 93.0% reduction in #FLOPs compared to the decoder of original 3D UX-Net. Similarly, for SwinUNETR and SwinUNETRv2 (which share an identical decoder), we observe reductions of 94.9% in #Params and 86.2% in #FLOPs. Our extensive experiments on 12 different medical imaging tasks confirm that EffiDec3D not only significantly reduces the computational demands, but also maintains a performance level comparable to original models, thus establishing a new standard for efficient 3D medical image segmentation. We will make the source code public upon …
Poster
Jiao Xu · Xin Chen · Lihe Zhang

[ ExHall D ]

Abstract
In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher-student roles. Additionally, we introduce a multi-view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies.Experiments demonstrate that our DiCo method sets new state-of-the-art performance on three 3D vessel segmentation benchmarks. Code and models will be made available.
Poster
Peirong Liu · Ana Lawry Aguila · Juan Iglesias

[ ExHall D ]

Abstract
Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images …
Poster
Wensheng Cheng · Zhenghong Li · Jiaxiang Ren · Hyomin Jeong · Congwu Du · Yingtian Pan · Haibin Ling

[ ExHall D ]

Abstract
Estimating blood flow speed is essential in many medical and physiological applications, yet it is extremely challenging due to complex vascular structure and flow dynamics, particularly for cerebral cortex regions.Existing techniques, such as Optical Doppler Tomography (ODT), generally require complex hardware control and signal processing, and still suffer from inherent system-level artifacts.To address these challenges, we propose a new learning-based approach named OCTA-Flow, which directly estimates vascular blood flow speed from Optical Coherence Tomography Angiography (OCTA) images, which are commonly used for vascular structure analysis. OCTA-Flow employs several novel components to achieve this goal. First, using an encoder-decoder architecture, OCTA-Flow leverages ODT data as pseudo labels during training, thus bypassing the difficulty of collecting ground truth data. Second, to capture the relationship between vessels of varying scales and their flow speed, we design an Adaptive Window Fusion module that employs multiscale window attention. Third, to mitigate ODT artifacts, we incorporate a Conditional Random Field Decoder that promotes smoothness and consistency in the estimated blood flow. Together, these innovations enable OCTA-Flow to effectively produce accurate flow estimation, suppress the artifacts in ODT, and enhance practicality, benefiting from the established techniques of OCTA data acquisition.The code and data will be made publicly …
Poster
Shufan Xi · Zexian Liu · Junlin Chang · Hongyu Wu · Xiaogang Wang · Aimin Hao

[ ExHall D ]

Abstract
3D intraoral scan mesh is widely used in digital dentistry diagnosis, segmenting 3D intraoral scan mesh is a critical preliminary task.Numerous approaches have been devised for precise tooth segmentation. Currently, the deep learning-based methods are capable of the high accuracy segmentation of crown.However, the segmentation accuracy at the junction between the crown and the gum is still below average.This is because the existing down-sampling methods are unable to effectively preserve the geometric details at the junction.To address these problems, we propose CrossTooth, a boundary-preserving segmentation method that combines 3D mesh selective downsampling to retain more vertices at the tooth-gingiva area, along with cross-modal discriminative boundary features extracted from multi-view rendered images, enhancing the geometric representation of the segmentation network.Using a point network as a backbone and incorporating image complementary features, CrossTooth significantly improves segmentation accuracy, as demonstrated by experiments on a public intraoral scan dataset.