Poster
Ziqiao Peng · Yanbo Fan · Haoyu Wu · Xuan Wang · Hongyan Liu · Jun He · Zhaoxin Fan
[ ExHall D ]
Abstract
In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task—multi-round dual-speaker interaction for 3D talking head generation—which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. Code and dataset will be released upon acceptance.
Poster
Lee Chae-Yeon · Hyun-Bin Oh · EunGi Han · Kim Sung-Bin · Suekyeong Nam · Tae-Hyun Oh
[ ExHall D ]
Abstract
Recent advancements in speech-driven 3D talking head generation have achieved impressive advance in lip synchronization. However, existing models still fall short in capturing a perceptual alignment between diverse speech characteristics and lip movements. In this work, we define essential criteria—temporal synchronization, lip readability, and expressiveness— for perceptually accurate lip movements in response to speech signals. We also introduce a speech-mesh synchronized representation that captures the intricate correspondence between speech and facial mesh. We plug in this representation as a perceptual loss to guide lip movements, ensuring they are perceptually aligned with the given speech. Additionally, we utilize this representation as a perceptual metric and introduce two other physically-grounded lip synchronization metrics to evaluate these three criteria. Experiments demonstrate that training 3D talking head models with our perceptual loss significantly enhances all three aspects of perceptually accurate lip synchronization. Codes will be released if accepted.
Poster
Dingcheng Zhen · Shunshun Yin · Shiyang Qin · Hou Yi · Ziwei Zhang · Siyuan Liu · Gan Qi · Ming Tao
[ ExHall D ]
Abstract
In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion.Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements.Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments …
Poster
Jiahao Cui · Hui Li · Qingkun Su · Hanlin Shang · Kaihui Cheng · Yuqi Ma · Shan Mu · Hang Zhou · Jingdong Wang · Siyu Zhu
[ ExHall D ]
Abstract
Existing methodologies for animating portrait images encounter significant challenges, particularly in addressing non-frontal perspectives, rendering dynamic objects surrounding the portrait, and generating immersive, realistic backgrounds across various scenarios. This paper proposes a novel approach that integrates a diffusion framework with a transformer-based architecture to enhance the realism and dynamism of portrait animations. Our methodology introduces three key innovations. First, we employ speech audio conditioning through cross-attention mechanisms to ensure precise alignment between audio signals and facial dynamics. Second, we incorporate an identity reference network into the diffusion transformer framework, thereby preserving facial identity consistently across video sequences. Third, our approach facilitates long-duration video extrapolation through motion frames, enabling the generation of extended video clips. We validated our method through experiments conducted on benchmark datasets and newly proposed wild datasets, demonstrating substantial improvements over previous methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes.
Poster
Shuyuan Tu · Zhen Xing · Xintong Han · Zhi-Qi Cheng · Qi Dai · Chong Luo · Zuxuan Wu
[ ExHall D ]
Abstract
Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
Poster
Yuan Li · Ziqian Bai · Feitong Tan · Zhaopeng Cui · Sean Fanello · Yinda Zhang
[ ExHall D ]
Abstract
We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets.Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently.To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information.Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.
Poster
yating wang · Xuan Wang · Ran Yi · Yanbo Fan · Jichen Hu · Jingcheng Zhu · Lizhuang Ma
[ ExHall D ]
Abstract
Recent studies have combined 3D Gaussian and 3D Morphable Models (3DMM) to construct high-quality 3D head avatars. In this line of research, existing methods either fail to capture the dynamic textures or incur significant overhead in terms of runtime speed or storage space. To this end, we propose a novel method that addresses all the aforementioned demands. In specific, we introduce an expressive and compact representation that encodes texture-related attributes of the 3D Gaussians in the tensorial format. We store appearance of neutral expression in static tri-planes, and represents dynamic texture details for different expressions using lightweight 1D feature lines, which are then decoded into opacity offset relative to the neutral face. We further propose adaptive truncated opacity penalty and class-balanced sampling to improve generalization across different expressions. Experiments show this design enables accurate face dynamic details capturing while maintains real-time rendering and significantly reduces storage costs, thus broadening the applicability to more scenarios.
Poster
Di Liu · Teng Deng · Giljoo Nam · Yu Rong · Stanislav Pidhorskyi · Junxuan Li · Jason Saragih · Dimitris N. Metaxas · Chen Cao
[ ExHall D ]
Abstract
Photorealistic 3D head avatar reconstruction faces critical challenges in modeling dynamic face-hair interactions and achieving cross-identity generalization, particularly during expressions and head movements. We present LUCAS, a novel Universal Prior Model (UPM) for codec avatar modeling that disentangles face and hair through a layered representation. Unlike previous UPMs that treat hair as an integral part of the head, our approach separates the modeling of the hairless head and hair into distinct branches. LUCAS is the first to introduce a mesh-based UPM, facilitating real-time rendering on devices. LUCAS can be integrated with Gaussian Splatting to enhance visual fidelity, a feature particularly beneficial for rendering complex hairstyles. Experimental results indicate that LUCAS outperforms existing single-mesh and Gaussian-based avatar models in both quantitative and qualitative assessments, including evaluations on held-out subjects in zero-shot driving scenarios. LUCAS demonstrates superior dynamic performance in managing head pose changes, expression transfer, and hairstyle variations, thereby advancing the state-of-the-art in 3D head avatar reconstruction.
Poster
SooHyun Lee · SeoYeon Kim · HeeKyung Lee · Won-Sik Cheong · Joo Ho Lee
[ ExHall D ]
Abstract
Multi-person avatar reconstruction from sparse multiview videos is challenging.The independent reconstruction of individual avatars often fails to capture the geometric relationships among multiple instances, resulting in inter-penetrations between avatars.While some researchers have resolved this issue using neural volumetric rendering techniques, these approaches suffer from huge computational costs for rendering and training.In this paper, we propose a multi-person avatar reconstruction method that reconstructs 3D avatars while preserving the geometric relations between people.Our 2D Gaussian Splatting (2DGS)-based avatar representation allows us to represent geometrically accurate surfaces of multiple instances that support sharp inside-outside tests.To efficiently influence the occluded instances, we design a differentiable multi-layer alpha blending system compatible with the GS rendering pipeline.We mitigate inter-penetrations among avatars by penalizing segmentation discrepancies and seeing through near-contact regions to reveal penetrating parts.We also utilize monocular priors to enhance quality in less-observed and textureless surfaces.Our proposed method achieves fast reconstruction while maintaining state-of-the-art performance in terms of geometry and rendering quality.We demonstrate the efficiency and effectiveness of our method on a multi-person dataset containing close interactions.
Poster
Lingteng Qiu · Shenhao Zhu · Qi Zuo · Xiaodong Gu · Yuan Dong · Junfei Zhang · Chao Xu · Zhe Li · Weihao Yuan · Liefeng Bo · Guanying Chen · Zilong Dong
[ ExHall D ]
Abstract
Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based Text-to-Video model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale monocular video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.
Poster
Zhichao Zhai · Guikun Chen · Wenguan Wang · Dong Zheng · Jun Xiao
[ ExHall D ]
Abstract
Decoupling from customized parametric templates represents a crucial step toward the creation of fully flexible, animatable articulated models. While existing template-free methods can achieve high-fidelity reconstruction in observed views, they struggle to recover plausible canonical models, resulting in suboptimal animation quality. This limitation stems from overlooking the fundamental ambiguities in canonical reconstruction, where multiple canonical models could explain the same observed views. This work reveals the entanglement between canonical ambiguities and incorrect skinning, and presents a self-supervised framework that learns both plausible skinning and accurate canonical geometry using only sparse pose data. Our method, TAGA, uses explicit 3D Gaussians as skinning carriers and characterizes ambiguities as "Ambiguous Gaussians" with incorrect skinning weights. TAGA then corrects ambiguous Gaussians in the observation space using anomaly detection. With the corrected ones, we enforce cycle consistency constraints on both geometry and skinning to refine the corresponding Gaussians in the canonical space through a new backward method. Compared to existing state-of-the-art template-free methods, TAGA delivers superior visual fidelity for novel views and poses, while significantly improving training and rendering speeds. Experiments on challenging datasets with limited pose variations further demonstrate the robustness and generality of TAGA. The code will be released.
Poster
Mingze Sun · Junting Dong · Junhao Chen · Yurun Chen · Xinyu Jiang · Shiwei Mao · Puhua Jiang · Jingbo Wang · Bo Dai · Ruqi Huang
[ ExHall D ]
Abstract
Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.
Poster
Yifang Men · Yuan Yao · Miaomiao Cui · Liefeng Bo
[ ExHall D ]
Abstract
Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for flexible controls, pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize realistic character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip into three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized …
Poster
Satyajit Tourani · Siddharth Tourani · Arif Mahmood · Muhammad Haris Khan
[ ExHall D ]
Abstract
Unsupervised landmark and head pose estimation is fundamental in fields like biometrics, augmented reality, and emotion recognition, offering accurate spatial data without relying on labeled datasets. It enhances scalability, adaptability, and generalization across diverse settings, where manual labeling is costly. In this work we exploit Stable Diffusion to approach the challenging problem of unsupervised landmarks and head pose estimation and make following contributions. (a) We propose a semantic-aware landmark localization algorithm including a consistent landmarks selection technique. (b) To encode landmarks and their holistic configuration, we propose learning image-aware textual embedding. (c) A novel algorithm for landmarks-guided 3D head pose estimation is also proposed. (d) We refine the landmarks using head pose by innovating a 3D rendering based augmentation and pose-based batching technique while the refined landmarks, consequently improving the head pose. (e) We report a new state-of-the-art in unsupervised facial landmark estimation across five challenging datasets including AFLW2000, MAFL, Cat-Heads, LS3D and a facial landmark tracking benchmark 300VW. In unsupervised head pose estimation, we outperform existing methods on BIWI and AFLW2000 by visible margins. Moreover, our method provides a significant training speed-up over the existing best unsupervised landmark detection method.
Poster
Yuxi Mi · Zhizhou Zhong · Yuge Huang · Qiuyang Yuan · Xuan Zhao · Jianqing Xu · Shouhong Ding · ShaoMing Wang · Rizen Guo · Shuigeng Zhou
[ ExHall D ]
Abstract
Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator's responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.
Poster
Michelle Guo · Matt Jen-Yuan Chiang · Igor Santesteban · Nikolaos Sarafianos · Hsiaoyu Chen · Oshri Halimi · Aljaž Božič · Shunsuke Saito · Jiajun Wu · Karen Liu · Tuur Stuyck · Egor Larionov
[ ExHall D ]
Abstract
We introduce a novel approach to reconstruct simulation-ready garments with intricate appearance. Despite recent advancements, existing methods often struggle to balance the need for accurate garment reconstruction with the ability to generalize to new poses and body shapes or require large amounts of data to achieve this. In contrast, our method only requires a multi-view capture of a single static frame. We represent garments as hybrid mesh-embedded 3D Gaussian splats (or simply Gaussians), where the Gaussians capture near-field shading and high-frequency details, while the mesh encodes far-field albedo and optimized reflectance parameters. We achieve novel pose generalization by exploiting the mesh from our hybrid approach, enabling physics-based simulation and surface rendering techniques, while also capturing fine details with Gaussians that accurately reconstruct garment details. Our optimized garments can be used for simulating garments on novel poses, and garment relighting.
Poster
Zeqing Wang · Qingyang Ma · Wentao Wan · Haojie Li · Keze Wang · Yonghong Tian
[ ExHall D ]
Abstract
Recent improvements in visual synthesis have significantly enhanced the depiction of generated human photos, which are pivotal due to their wide applicability and demand. Nonetheless, the existing text-to-image or text-to-video models often generate low-quality human photos that might differ considerably from real-world body structures, referred to as abnormal human bodies''. Such abnormalities, typically deemed unacceptable, pose considerable challenges in the detection and repair of them within human photos. These challenges require precise abnormality recognition capabilities, which entail pinpointing both the location and the abnormality type. Intuitively, Visual Language Models (VLMs) that have obtained remarkable performance on various visual tasks are quite suitable for this task. However, their performance on abnormality detection in human photos is quite poor.Hence, it is quite important to highlight this task for the research community. In this paper, we first introduce a simple yet challenging task, i.e., \textbf{F}ine-grained \textbf{H}uman-body \textbf{A}bnormality \textbf{D}etection \textbf{(FHAD)}, and construct two high-quality datasets for evaluation. Then, we propose a meticulous framework, named HumanCalibrator, which identifies and repairs abnormalities in human body structures while preserving the other content. Experiments indicate that our HumanCalibrator achieves high accuracy in abnormality detection and accomplishes an increase in visual comparisons while preserving the other visual content.
Poster
Nannan Li · Kevin Shih · Bryan A. Plummer
[ ExHall D ]
Abstract
Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment.Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schr\"odinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schr\"odinger Bridge's noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB …
Poster
Yuanwei Liu · Hui Wei · Chengyu Jia · Ruqi Xiao · Weijian Ruan · Xingxing Wei · Joey Tianyi Zhou · Zheng Wang
[ ExHall D ]
Abstract
Previous physical adversarial attacks have shown that carefully crafted perturbations can deceive face recognition systems, revealing critical security vulnerabilities. However, these attacks often struggle to impersonate multiple targets and frequently fail to bypass liveness detection. For example, attacks using human-skin masks are challenging to fabricate, inconvenient to swap between users, and often fail liveness detection due to facial occlusions. A projector, however, can generate content-rich light without obstructing the face, making it ideal for non-intrusive attacks. Thus, we propose a novel physical adversarial attack using a projector and explore the superposition of projected and natural light to create adversarial facial images. This approach eliminates the need for physical artifacts on the face, effectively overcoming these limitations. Specifically, our proposed ProjAttacker generates adversarial 3D textures that are projected onto human faces. To ensure physical realizability, we introduce a light reflection function that models complex optical interactions between projected light and human skin, accounting for reflection and diffraction effects. Furthermore, we incorporate camera Image Signal Processing (ISP) simulation to maintain the robustness of adversarial perturbations across real-world diverse imaging conditions. Comprehensive evaluations conducted in both digital and physical scenarios validate the effectiveness of our method. Codes will be publicly available.
Poster
Yu-Cheng Chiu · GUAN-RONG CHEN · Zihao Chen · Yan-Tsung Peng
[ ExHall D ]
Abstract
The primary goal of white balance (WB) for sRGB images is to correct inaccurate color temperatures, making images exhibit natural, neutral colors. While existing WB methods achieve reasonable results, they are limited by the global color adjustments applied during a camera’s post-sRGB processing and the restricted color diversity in current datasets, often leading to suboptimal color correction, particularly in images with pronounced color shifts. To address these limitations, we propose an Auxiliary Bimodal Cross-domain Transformer (ABC-Former) that enhances WB correction by leveraging complementary knowledge from multiple modalities. ABC-Former employs two auxiliary models to extract global color information from CIELab and RGB color histograms, complementing the primary model’s sRGB input processing. We introduce an Interactive Channel Attention (ICA) module to facilitate cross-modality knowledge transfer, integrating calibrated color features into image features for more precise WB results. Experimental evaluations on benchmark WB datasets show that ABC-Former achieves superior performance, outperforming state-of-the-art WB methods.
Poster
Rui Xu · Yuzhen Niu · Yuezhou Li · Huangbiao Xu · Wenxi Liu · Yuzhong Chen
[ ExHall D ]
Abstract
Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.
Poster
Guanzhou Lan · Qianli Ma · YUQI YANG · Zhigang Wang · Dong Wang · Xuelong Li · Bin Zhao
[ ExHall D ]
Abstract
The computational burden of the iterative sampling process remains a major challenge in diffusion-based Low-Light Image Enhancement (LLIE). Current acceleration methods, whether training-based or training-free, often lead to significant performance degradation, highlighting the trade-off between performance and efficiency.In this paper, we identify two primary factors contributing to performance degradation: fitting errors and the inference gap. Our key insight is that fitting errors can be mitigated by linearly extrapolating the incorrect score functions, while the inference gap can be reduced by shifting the Gaussian flow to a reflectance-aware residual space.Based on the above insights, we design Reflectance-Aware Trajectory Refinement (RATR) module, a simple yet effective module to refine the teacher trajectory using the reflectance component of images. Following this, we introduce Reflectance-aware Diffusion with Distilled Trajectory ReDDiT, an efficient and flexible distillation framework tailored for LLIE. Our framework achieves comparable performance to previous diffusion-based methods with redundant steps in just 2 steps while establishing new state-of-the-art (SOTA) results with 8 or 4 steps. Comprehensive experimental evaluations on 10 benchmark datasets validate the effectiveness of our method, consistently outperforming existing SOTA methods.
Poster
Hesong Li · Ziqi Wu · Ruiwen Shao · Tao Zhang · Ying Fu
[ ExHall D ]
Abstract
Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements, and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in frequency domain formed by periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM …
Poster
Yujie Wang · Praneeth Chakravarthula · Baoquan Chen
[ ExHall D ]
Abstract
Gaussian Splatting techniques have recently enabled high-quality 3D scene reconstruction and real-time novel view synthesis. These approaches, however, are limited by the pinhole camera model and lacks support for modeling and rendering defocus effects. Departing from this, we introduce DOF-GS --- a new framework that aussian Splatting with a finite-aperture camera model and explicit, differentiable defocus rendering, enabling it to function as a post-capture control tool. DOF-GS enables dynamic depth-of-field (DOF) adjustment through on-demand post-capture aperture and focal distance control for the first time, to the best of our knowledge. By using multi-view images with moderate defocus blur as input, our framework learns inherent camera characteristics and reconstruct sharp details of the underlying scene, particularly, enabling rendering with varying DOF effects, post-capture and optimization. Additionally, our framework extracts circle-of-confusion cues during optimization to identify in-focus regions in input views, enhancing the reconstructed 3D scene details. Experimental results demonstrate that DOF-GS supports post-capture refocusing, adjustable defocus and high-quality all-in-focus rendering, from multi-view images with uncalibrated defocus blur.
Poster
Jingzhi Li · Zongwei Wu · Eduard Zamfir · Radu Timofte
[ ExHall D ]
Abstract
Accurate 3D objects relighting in diverse unseen environments is crucial for realistic virtual object placement. Due to the albedo-lighting ambiguity, existing methods often fall short in producing faithful relights. Without proper constraints, observed training views can be explained by numerous combinations of lighting and material attributes, lacking physical correspondence with the actual environment maps used for relighting. In this work, we present ReCap, treating cross-environment captures as multi-task target to provide the missing supervision that cuts through the entanglement. Specifically, ReCap jointly optimizes multiple lighting representations that share a common set of material attributes. This naturally harmonizes a coherent set of lighting representations around the mutual material attributes, exploiting commonalities and differences across varied object appearances. Such coherence enables physically sound lighting reconstruction and robust material estimation — both essential for accurate relighting. Together with a streamlined shading function and effective post-processing, ReCap outperforms the leading competitor by 3.4 dB in PSNR on an expanded relighting benchmark.
Poster
Yue Fan · Ningjing Fan · Ivan Skorokhodov · Oleg Voynov · Savva Ignatyev · Evgeny Burnaev · Peter Wonka · Yiqun Wang
[ ExHall D ]
Abstract
We develop a method that recovers the surface, materials, and illumination of a scene from its posed multi-view images. In contrast to prior work, it does not require any additional data and can handle glossy objects or bright lighting. It is a progressive inverse rendering approach, which consists of three stages. In the first stage, we reconstruct the scene radiance and signed distance function (SDF) with a novel regularization strategy for specular reflections. We propose to explain a pixel color using both surface and volume rendering jointly, which allows for handling complex view-dependent lighting effects for surface reconstruction. In the second stage, we distill light visibility and indirect illumination from the learned SDF and radiance field using learnable mapping functions. Finally, we design a method for estimating the ratio of incoming direct light reflected in a specular manner and use it to reconstruct the materials and direct illumination. Experimental results demonstrate that the proposed method outperforms the current state-of-the-art in recovering surfaces, materials, and lighting without relying on any additional data.
Poster
Cheng-De Fan · Chen-Wei Chang · Yi-Ruei Liu · Jie-Ying Lee · Jiun-Long Huang · Yu-Chee Tseng · Yu-Lun Liu
[ ExHall D ]
Abstract
We present SpectroMotion, a novel approach that combines 3D Gaussian Splatting (3DGS) with physically-based rendering (PBR) and deformation fields to reconstruct dynamic specular scenes. Previous methods extending 3DGS to model dynamic scenes have struggled to represent specular surfaces accurately. Our method addresses this limitation by introducing a residual correction technique for accurate surface normal computation during deformation, complemented by a deformable environment map that adapts to time-varying lighting conditions. We implement a coarse-to-fine training strategy significantly enhancing scene geometry and specular color prediction. It is the only existing 3DGS method capable of synthesizing photorealistic real-world dynamic specular scenes, outperforming state-of-the-art methods in rendering complex, dynamic, and specular scenes.
Poster
Xingyu Chen · Zihao Feng · Kun Qian · Xinyu Zhang
[ ExHall D ]
Abstract
Radio frequency (RF) propagation modeling poses unique electromagnetic simulation challenges. While recent neural representations have shown success in visible spectrum rendering, the fundamentally different scales and physics of RF signals require novel modeling paradigms. In this paper, we introduce RFScape, a novel framework that bridges the gap between neural scene representation and RF propagation modeling. Our key insight is that complex RF-object interactions can be captured through object-centric neural representations while preserving the composability of traditional ray tracing. Unlike previous approaches that either rely on crude geometric approximations or require dense spatial sampling of entire scenes, RFScape learns per-object electromagnetic properties and enables flexible scene composition. Through extensive evaluation on real-world RF testbeds, we demonstrate that our approach achieves 13 dB improvement over conventional ray tracing and 5 dB over state-of-the-art neural baselines in modeling accuracy, while requiring only sparse training samples.
Poster
You Wang · Li Fang · Hao Zhu · Fei Hu · Long Ye · Zhan Ma
[ ExHall D ]
Abstract
Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. We will open-source our code upon the paper's acceptance.
Poster
Jan Held · Renaud Vandeghen · Abdullah J Hamdi · Anthony Cioppa · Adrien Deliege · Silvio Giancola · Andrea Vedaldi · Bernard Ghanem · Marc Van Droogenbroeck
[ ExHall D ]
Abstract
Recent advances in radiance field reconstruction, such as 3D Gaussian Splatting (3DGS), have achieved high-quality novel view synthesis and fast rendering by representing scenes with compositions of Gaussian primitives. However, 3D Gaussians present several limitations for scene reconstruction. Accurately capturing hard edges is challenging without significantly increasing the number of Gaussians, creating a large memory footprint. Moreover, they struggle to represent flat surfaces, as they are diffused in space. Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. Smooth convex shapes offer greater flexibility than Gaussians, allowing for a better representation of 3D scenes with hard edges and dense volumes using fewer primitives. Powered by our efficient CUDA-based rasterizer, 3DCS achieves superior performance over 3DGS on benchmarks such as Mip-NeRF360, Tanks and Temples, and Deep Blending. Specifically, our method attains an improvement of up to 0.81 in PSNR and 0.026 in LPIPS compared to 3DGS while maintaining high rendering speeds and reducing the number of required primitives. Our results highlight the potential of 3D Convex Splatting to become …
Poster
Stefano Esposito · Anpei Chen · Christian Reiser · Samuel Rota Bulò · Lorenzo Porzi · Katja Schwarz · Christian Richardt · Michael Zollhoefer · Peter Kontschieder · Andreas Geiger
[ ExHall D ]
Abstract
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed order. First, we model surface layers as SDF shells with optimal spacing learned during training. Then, we bake them as meshes and fit UV textures. Unlike single-surface methods, our multi-layer representation effectively models fuzzy objects. In contrast to volume-based and splatting-based methods, our approach enables real-time rendering on low-cost smartphones.
Poster
Shu Wang · Yanbo Gao · Shuai Li · Chong Lv · Xun Cai · chuankun Li · Hui Yuan · jinglin zhang
[ ExHall D ]
Abstract
This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability.
Poster
Xiangjun Gao · Xiaoyu Li · Yiyu Zhuang · Qi Zhang · Wenbo Hu · Chaopeng Zhang · Yao Yao · Ying Shan · Long Quan
[ ExHall D ]
Abstract
Neural 3D representations such as Neural Radiance Fields (NeRFs), excel at producing photo-realistic rendering results but lack the flexibility for manipulation and editing which is crucial for content creation. Previous works have attempted to address this issue by deforming a NeRF in canonical space or manipulating the radiance field based on an explicit mesh. However, manipulating NeRF is not highly controllable and requires a long training and inference time. With the emergence of 3D Gaussian Splatting (3DGS), extremely high-fidelity novel view synthesis can be achieved using an explicit point-based 3D representation with much faster training and rendering speed. However, there is still a lack of effective means to manipulate 3DGS freely while maintaining rendering quality. In this work, we aim to tackle the challenge of achieving manipulable photo-realistic rendering. We propose to utilize a triangular mesh to manipulate 3DGS directly with self-adaptation. This approach reduces the need to design various algorithms for different types of Gaussian manipulation. By utilizing a triangle shape-aware Gaussian binding and adapting method, we can achieve 3DGS manipulation and preserve high-fidelity rendering after manipulation. Our approach is capable of handling large deformations, local manipulations, and even physics simulations while keeping high-quality rendering. Furthermore, we demonstrate that …
Poster
Dana Cohen-Bar · Daniel Cohen-Or · Gal Chechik · Yoni Kasten
[ ExHall D ]
Abstract
As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present TriTex, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh. Despite training on just one example, it generalizes effectively to diverse shapes within the same category. Extensive evaluation on our newly created benchmark dataset shows that TriTex achieves superior texture transfer quality and fast inference times compared to existing methods. Our approach advances single-example texture transfer, providing a practical solution for maintaining visual coherence across related 3D models in applications like game development and simulation.
Poster
Armin Shafiee Sarvestani · Sheyang Tang · Zhou Wang
[ ExHall D ]
Abstract
Mesh quality assessment (MQA) models play a critical role in the design, optimization, and evaluation of mesh operation systems in a wide variety of applications. Current MQA models, whether model-based methods using topology-aware features or projection-based approaches working on rendered 2D projections, often fail to capture the intricate interactions between texture and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid full-reference colored MQA framework that integrates model-based and projection-based approaches, capturing complex interactions between textural information and 3D structures for enriched quality representations. Our method employs graph learning to extract detailed 3D representations, which are then projected to 2D using a novel feature rendering process that precisely aligns them with colored projections. This enables the exploration of geometry-texture interactions via cross-attention, producing comprehensive mesh quality representations. Extensive experiments demonstrate HybridMQA’s superior performance across diverse datasets, highlighting its ability to effectively leverage geometry-texture interactions for a thorough understanding of mesh quality. Our implementation will be made publicly available.
Poster
Xiang Feng · Chang Yu · Zoubin Bi · Yintong Shang · Feng Gao · Hongzhi Wu · Kun Zhou · Chenfanfu Jiang · Yin Yang
[ ExHall D ]
Abstract
Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.
Poster
Jing Li · Yihang Fu · Falai Chen
[ ExHall D ]
Abstract
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps.
Poster
Yuan Li · Cheng Lin · Yuan Liu · Xiaoxiao Long · Chenxu Zhang · Ningna Wang · Xin Li · Wenping Wang · Xiaohu Guo
[ ExHall D ]
Abstract
The field of diffusion-based 3D generation has experienced tremendous progress in recent times. However, existing 3D generative models often produce overly dense and unstructured meshes, which are in stark contrast to the compact, structured and clear-edged CAD models created by human modelers. We introduce CADDreamer, a novel method for generating CAD objects from a single image. This method proposes a primitive-aware multi-view diffusion model, which perceives both local geometry and high-level structural semantics during the generation process. We encode primitive semantics into the color domain, and enforce the strong priors in pre-trained diffusion models to align with the well-defined primitives. As a result, we can infer multi-view normal maps and semantic maps from a single image, thereby reconstructing a mesh with primitive labels. Correspondingly, we propose a set of fitting and optimization methods to deal with the inevitable noise and distortion in generated primitives, ultimately producing a complete and seamless Boundary Representation (B-rep) of a Computer-Aided Design (CAD) model. Experimental results demonstrate that our method can effectively recover high-quality CAD objects from single-view images. Compared to existing 3D generation methods, the models produced by CADDreamer are compact in representation, clear in structure, sharp in boundaries, and watertight in topology.
Poster
Yiftach Edelstein · Or Patashnik · Dana Cohen-Bar · Lihi Zelnik-Manor
[ ExHall D ]
Abstract
Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method …
Poster
Jianfeng XIANG · Zelong Lv · Sicheng Xu · Yu Deng · Ruicheng Wang · Bowen Zhang · Dong Chen · Xin Tong · Jiaolong Yang
[ ExHall D ]
Abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation.The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
Poster
Xingyi Yang · Songhua Liu · Xinchao Wang
[ ExHall D ]
Abstract
The quality of 3D generative modeling has been notably improved by the adoption of 2D diffusion models. Despite this progress, the cumbersome optimization process \emph{per se}presents a critical problem to efficiency. In this paper, we introduce Hash3D, a universal acceleration for 3D score distillation sampling~(SDS) without model training.Central to Hash3D is the observation that images rendered from similar camera positions and diffusion time-steps often have redundant feature maps. By hashing and reusing these feature maps across nearby timesteps and camera angles, Hash3D eliminates unnecessary calculations. We implement this through an adaptive grid-based hashing. As a result, it largely speeds up the process of 3D generation. Surprisingly, this feature-sharing mechanism not only makes generation faster but also improves the smoothness and view consistency of the synthesized 3D objects. Our experiments covering 5 text-to-3D and 3 image-to-3D models, demonstrate Hash3D’s versatility to speed up optimization, enhancing efficiency by 1.5∼4×. Additionally, Hash3D's integration with 3D Gaussian splatting largely speeds up 3D model creation, reducing text-to-3D conversion to about 10 minutes and image-to-3D conversion to 30 seconds.
Poster
Trong-Tung Nguyen · Quang Nguyen · Khoi Nguyen · Anh Tran · Cuong Pham
[ ExHall D ]
Abstract
Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results.
Poster
Weiran Guang · Xiaoguang Gu · Mengqi Huang · Zhendong Mao
[ ExHall D ]
Abstract
Interactive drag editing of images is a valuable task that has gained considerable attention for its precision and controllability. However, existing approaches have primarily focused on manipulating the shape or movement of objects in 2D plane. We propose to extend this drag-based editing task to 3D space. Firstly, we utilize the trajectory of two points to represent the rotational trajectory of the object. Gaussian maps of a circle and a square are centered at these two points, respectively. We use distinct shapes to ensure that symmetric views produce different object representations. Secondly, we introduce a lightweight mapping network to embed the object features into two Gaussian maps to obtain a continuous control condition that guides the model in learning the correspondence between the trajectory and the object. Finally, to overcome the limitations of current 3D object reconstruction datasets, which typically consist of object maps with transparent backgrounds, we affix random backgrounds to them. This modification helps improve the model's ability to ignore background interference when editing real images with complex backgrounds. Experiments demonstrate that our approach successfully achieves object rotation within the drag framework and demonstrates strong generalization to real-world images.
Poster
Yihua Huang · Mingxian Lin · Yangtian Sun · Ziyi Yang · Xiaoyang Lyu · Yan-Pei Cao · Xiaojuan Qi
[ ExHall D ]
Abstract
Recently, Gaussian splatting has emerged as a robust technique for representing 3D scenes, enabling real-time rasterization and high-fidelity rendering. However, Gaussians' inherent radial symmetry and smoothness constraints limit their ability to represent complex shapes, often requiring thousands of primitives to approximate detailed geometry. We introduce Deformable Radial Kernel (DRK), which extends Gaussian splatting into a more general and flexible framework. Through learnable radial bases with adjustable angles and scales, DRK efficiently models diverse shape primitives while enabling precise control over edge sharpness and boundary curvature. iven DRK's planar nature, we further develop accurate ray-primitive intersection computation for depth sorting and introduce efficient kernel culling strategies for improved rasterization efficiency. Extensive experiments demonstrate that DRK outperforms existing methods in both representation efficiency and rendering quality, achieving state-of-the-art performance while dramatically reducing primitive count.
Poster
Hyojun Go · byeongjun park · Jiho Jang · Jin-Young Kim · Soonwoo Kwon · Changick Kim
[ ExHall D ]
Abstract
Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts—thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks—including object editing, novel view synthesis, and camera pose estimation—within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
Poster
Alex Hanson · Allen Tu · Geng Lin · Vasu Singla · Matthias Zwicker · Tom Goldstein
[ ExHall D ]
Abstract
3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians.However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings.In this paper, we identify and address two key inefficiencies in 3D-GS, achieving substantial improvements in rendering speed, model size, and training time.First, we optimize the rendering pipeline to precisely localize Gaussians in the scene, boosting rendering speed without altering visual fidelity.Second, we introduce a novel pruning technique and integrate it into the training pipeline, significantly reducing model size and training time while further raising rendering speed.Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic 6.71× across scenes from the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets with 10.6× fewer primitives than 3D-GS.
Poster
Jinguang Tong · Xuesong li · Fahira Afzal Maken · Sundaram Muthu · Lars Petersson · Chuong Nguyen · Hongdong Li
[ ExHall D ]
Abstract
3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from a foundation model. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster.
Poster
Yiyang Shen · Kun Zhou · He Wang · Yin Yang · Tianjia Shao
[ ExHall D ]
Abstract
Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in …
Poster
Yifan Liu · Keyu Fan · Weihao Yu · Chenxin Li · Hao Lu · Yixuan Yuan
[ ExHall D ]
Abstract
Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components working in harmony: a Mono-Multi Feature Adapter that transforms monocular features into cross-view-aware multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods, while maintaining computational efficiency with minimal trainable parameters. We will make our codes and models publicly available.
Poster
Han Zhou · Wei Dong · Jun Chen
[ ExHall D ]
Abstract
Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality normally-exposed representation due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, huge noise, and color distortion poses substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS can not achieve satisfactory performance due to their individual enhancement process, which leads to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose \textbf{LITA-GS}, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively surpass the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpass the state-of-the-art (SOTA) NeRF-based method by 1.7 dB in PSNR and 0.09 in SSIM while enjoying faster …
Poster
Zheng Chen · Chenming Wu · Zhelun Shen · Chen Zhao · Weicai Ye · Haocheng Feng · Errui Ding · Song-Hai Zhang
[ ExHall D ]
Abstract
Wide-baseline panoramic images are commonly used in applications such as VR and simulation rendering to reduce network bandwidth and storage requirements. However, synthesizing novel views from these panoramic images in real time remains a significant challenge, especially due to the high resolution and inherent distortions of panoramic imagery. Although existing 3D Gaussian splatting (3DGS) methods can produce photo-realistic views under narrow baselines, they often overfit the training views when dealing with wide-baseline panoramic images due to the difficulty in learning precise geometry from sparse 360-degree views. This paper presents Splatter-360, a novel end-to-end generalizable 3DGS framework specifically designed to handle wide-baseline panoramic images. Unlike previous approaches, Splatter-360 performs multi-view matching directly in the spherical domain by constructing a spherical cost volume through a spherical sweep algorithm, enhancing the network's depth perception and geometry estimation. Additionally, we introduce a 3D-aware bi-projection encoder to mitigate the distortions inherent in panoramic images and integrate cross-view attention to improve feature interactions across multiple viewpoints. This enables robust 3D-aware feature representations and real-time rendering capabilities. Experimental results on the HM3D and Replica demonstrate that Splatter-360 significantly outperforms state-of-the-art NeRF and 3DGS methods (e.g., PanoGRF, MVSplat, DepthSplat, and HiSplat) in both synthesis quality and generalization performance …
Poster
Hyunwoo Park · Gun Ryu · Wonjun Kim
[ ExHall D ]
Abstract
Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity.
Poster
Dian Zheng · Cheng Zhang · Xiao-Ming Wu · Cao Li · Chengfei Lv · Jian-Fang Hu · Wei-Shi Zheng
[ ExHall D ]
Abstract
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is not suitable for evaluating the distortion. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the ``visual cheating'' phenomenon in previous works (i.e., tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose PanoDecouple, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that …
Poster
Yucheng Mao · Boyang Wang · Nilesh Kulkarni · Jeong Joon Park
[ ExHall D ]
Abstract
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
Poster
Hao Wen · Zehuan Huang · Yaohui Wang · Xinyuan Chen · Lu Sheng
[ ExHall D ]
Abstract
Existing image-to-3D creation methods typically split the task into two individual stage: multi-view image generation and 3D reconstruction, leading to two main limitations: (1) In multi-view generation stage, the multi-view generated images present a challenge to preserving 3D consistency;; (2) In 3D reconstruction stage, there is a domain gap between real training data and generated multi-view input during inference. To address these issues, we propose Ouroboros3D, end-to-end trainable framework that integrates multi-view generation and 3D reconstruction into a recursive diffusion process through feedback mechanism.Our framework operates through iterative cycles where each cycle consists of a denoising process and a reconstruction step.By incorporating a 3D-aware feedback mechanism, our multi-view generative model leverages the explicit 3D geometric information (e.g. texture, position) from the feedback of reconstruction results of the previous process as conditions, thus modeling consistency at the 3D geometric level. Furthermore, through joint training of both the multi-view generative and reconstruction models, we alleviate reconstruction stage domain gap and enable mutual enhancement within the recursive process. Experimental results demonstrate that Ouroboros3D outperforms methods that treat these stages separately and those that combine them only during inference, achieving superior multi-view consistency and producing 3D models with higher geometric realism.
Poster
Wenyuan Zhang · Yixiao Yang · Han Huang · Liang Han · Kanle Shi · Yu-Shen Liu · Zhizhong Han
[ ExHall D ]
Abstract
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose \textbf{MonoInstance}, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
Poster
Bo Ji · Angela Yao
[ ExHall D ]
Abstract
Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations -- each optimized for specific scene regions -- into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. Codes will be released upon acceptance.
Poster
Xiangyu Liu · Xiaomei Zhang · Zhiyuan Ma · Xiangyu Zhu · Zhen Lei
[ ExHall D ]
Abstract
Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user’s input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.
Poster
Khiem Vuong · Anurag Ghosh · Deva Ramanan · Srinivasa G. Narasimhan · Shubham Tulsiani
[ ExHall D ]
Abstract
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 3% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 50%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, …
Poster
Qihang Zhang · Shuangfei Zhai · Miguel Ángel Bautista · Kevin Miao · Alexander Toshev · Joshua Susskind · Jiatao Gu
[ ExHall D ]
Abstract
Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.
Poster
Haosen Yang · Chenhao Zhang · Wenqing Wang · Marco Volino · Adrian Hilton · Li Zhang · Xiatian Zhu
[ ExHall D ]
Abstract
Point management is critical for optimizing 3D Gaussian Splatting models, as point initiation (e.g., via structure from motion) is often distributionally inappropriate. Typically, Adaptive Density Control (ADC) algorithm is adopted, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. We reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) due to inability of identifying all 3D zones requiring point densification, and lacking an appropriate mechanism to handle ill-conditioned points with negative impacts (e.g., occlusion due to false high opacity).To address these limitations, we propose a Localized Point Management(LPM) strategy, capable of identifying those error-contributing zones in greatest need for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, subject to image rendering errors. We apply point densification in the identified zones and then reset the opacity of the points in front of these regions, creating a new opportunity to correct poorly conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing static 3D and dynamic 4D Gaussian Splatting models with minimal additional cost.Experimental evaluations validate the efficacy of our LPM in boosting a variety of existing …
Poster
Sebastian Koch · Johanna Wald · Mirco Colosi · Narunas Vaskevicius · Pedro Hermosilla · Federico Tombari · Timo Ropinski
[ ExHall D ]
Abstract
Neural radiance fields are an emerging 3D scene representation and recently even been extended to learn features for scene understanding by distilling open-vocabulary features from vision-language models. However, current method primarily focus on object-centric representations, supporting object segmentation or detection, while understanding semantic relationships between objects remains largely unexplored. To address this gap, we propose RelationField, the first method to extract inter-object relationships directly from neural radiance fields. RelationField represents relationships between objects as pairs of rays within a neural radiance field, effectively extending its formulation to include implicit relationship queries. To teach RelationField complex, open-vocabulary relationships, relationship knowledge is distilled from multi-modal LLMs. To evaluate RelationField, we solve open-vocabulary 3D scene graph generation tasks and relationship-guided instance segmentation, achieving state-of-the-art performance in both tasks.
Poster
Weikang Bian · Zhaoyang Huang · Xiaoyu Shi · Yijin Li · Fu-Yun Wang · Hongsheng Li
[ ExHall D ]
Abstract
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant …
Poster
Ashish Kumar · A. N. Rajagopalan
[ ExHall D ]
Abstract
Neural Radiance Fields (NeRFs) have made significant advances in rendering novel photorealistic views for both static and dynamic scenes. However, most prior works assume ideal conditions of artifact-free visual inputs i.e., images and videos. In real scenarios, artifacts such as object motion blur, camera motion blur, or lens defocus blur are ubiquitous. Some recent studies have explored novel view synthesis using blurred input frames by examining either camera motion blur, defocus blur, or both. However, these studies are limited to static scenes. In this work, we enable NeRFs to deal with object motion blur whose local nature stems from the interplay between object velocity and camera exposure time. Often, the object motion is unknown and time varying, and this adds to the complexity of scene reconstruction. Sports videos are a prime example of how rapid object motion can significantly degrade video quality for static cameras by introducing motion blur. We present an approach for realizing motion blur-free novel views of dynamic scenes from input videos with object motion blur captured from static cameras spanning multiple poses. We propose a NeRF-based analytical framework that elegantly correlates object three-dimensional (3D) motion across views as well as time to the observed blurry videos. …
Poster
Seungjun Lee · Gim Hee Lee
[ ExHall D ]
Abstract
Reconstructing sharp 3D representations from blurry multi-view images are long-standing problem in computer vision. Recent works attempt to enhance high-quality novel view synthesis from the motion blur by leveraging event-based cameras, benefiting from high dynamic range and microsecond temporal resolution. However, they often reach sub-optimal visual quality in either restoring inaccurate color or losing fine-grained details. In this paper, we present DiET-GS, a diffusion prior and event stream-assisted motion deblurring 3DGS. Our framework effectively leverages blur-free event streams and diffusion prior in a two-stage training strategy. Specifically, we introduce the novel framework to constraint 3DGS with event double integral, achieving both accurate color and well-defined details. Additionally, we propose a simple technique to leverage diffusion prior to further enhance the edge details. Qualitative and quantitative results on both synthetic and real-world data demonstrate that our DiET-GS is capable of producing better quality of novel views compared to the existing baselines. The code will be publicly available.
Poster
Yifan Wang · Peishan Yang · Zhen Xu · Jiaming Sun · Zhanhua Zhang · chen yong · Hujun Bao · Sida Peng · Xiaowei Zhou
[ ExHall D ]
Abstract
This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin. The code will be released for reproducibility.
Poster
Hao Li · Sicheng Li · Xiang Gao · AbudouaihatiBatuer · Lu Yu · Yiyi Liao
[ ExHall D ]
Abstract
Immersive video offers a 6-Dof-free viewing experience, potentially playing a key role in future video technology. Recently, 4D Gaussian Splatting has gained attention as an effective approach for immersive video due to its high rendering efficiency and quality, though maintaining quality with manageable storage remains challenging. To address this, we introduce GIFStream, a novel 4D Gaussian representation using a canonical space and a deformation field enhanced with time-dependent feature streams. These feature streams enable complex motion modeling and allow efficient compression by leveraging their motion-awareness and temporal correspondence. Additionally, we incorporate both temporal and spatial compression networks for end-to-end compression. Experimental results show that GIFStream delivers high-quality immersive video at 30 Mbps, with real-time rendering and fast decoding on an RTX 4090.
Poster
Hongchi Xia · Entong Su · Marius Memmel · Arhan Jain · Raymond Yu · Numfor Mbiziwo-Tiapo · Ali Farhadi · Abhishek Gupta · Shenlong Wang · Wei-Chiu Ma
[ ExHall D ]
Abstract
Creating virtual digital replicas from real-world data unlocks significant potential across domains like gaming and robotics. In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a *photorealistic* and *interactive* digital environment. Our approach centers on two main contributions: (i) a reconstruction module based on a *dual scene representation* that reconstructs the scene with *fine-grained geometric details*, and (ii) an *articulation* module that identifies articulation types and hinge positions, reconstructs simulatable shapes and appearances and integrates them into the scene. The resulting virtual environment is photorealistic, interactive, and runs in real time, with compatibility for game engines and robotic simulation platforms. We demonstrate the potential of DRAWER by using it to automatically create an interactive game in Unreal Engine and to enable real-to-sim-to-real transfer for robotics applications. Our paper consists of multiple videos. We recommend the readers to use Adobe Acrobat.
Poster
Shoichiro Takeda · Yasunori Akagi
[ ExHall D ]
Abstract
We propose novel fast algorithms for the Gromov–Wasserstein problem (GW) using cyclic symmetry of input data. Such GW with cyclic symmetry naturally appears as an object matching task underlying various real-world computer vision applications, e.g., image registration, point cloud registration, stereo matching, and 3D reconstruction. Gradient-based algorithms have been used to solve GW, and our main idea is to use the following remarkable and non-trivial property: By setting the initial solution to have cyclic symmetry, all intermediate solutions and matrices appearing in the gradient-based algorithms have the same cyclic symmetry until convergence. Based on this property, our gradient-based algorithms restrict the solution space to have cyclic symmetry and update only one of the symmetric parts of solutions and matrices at each iteration, which results in fast computation. Furthermore, the original gradient-based algorithms and ours must solve the Optimal Transport problem (OT) at each iteration, but only in ours does this problem exhibit cyclic symmetry. This cyclic OT can be solved efficiently, and as a result, the total computational time of our algorithms is dramatically faster than the original ones. Experiments showed the effectiveness of our algorithms in synthetic and real-world data with strict and approximate cyclic symmetry, respectively.
Poster
Awais Nizamani · Hamid Laga · Guanjin Wang · Farid Boussaid · Mohammed Bennamoun · Anuj Srivastava
[ ExHall D ]
Abstract
We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve overtime. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets.
Poster
Paul Roetzer · Viktoria Ehm · Daniel Cremers · Zorah Lähner · Florian Bernard
[ ExHall D ]
Abstract
In this work we address various shape matching problems that can be cast as finding cyclic paths in a product graph. This involves for example 2D-3D shape matching, 3D shape matching, or the matching of a contour to a graph. In this context, matchings are typically obtained as the minimum cost cycle in the product graph. Instead, inspired by related works on model-based image segmentation, we consider minimum ratio cycles, which we combine with the recently introduced conjugate product graph in order to allow for higher-order matching costs. With that, on the one hand we avoid the bias of obtaining matchings that involve fewer/shorter edges, while on the other hand being able to impose powerful geometric regularisation, e.g. to avoid zig-zagging. In our experiments we demonstrate that this not only leads to improved matching accuracy in most cases, but also to significantly reduced runtimes (up to two orders of magnitude, depending on the setting). Our GPU implementation will be made publicly available upon acceptance.
Poster
Ryota Maeda · Yunseong Moon · Seung-Hwan Baek
[ ExHall D ]
Abstract
Light-matter interactions modify both the intensity and polarization state of light. Changes in polarization, represented by a Mueller matrix, encode detailed scene information. Existing optical ellipsometers capture Mueller-matrix images; however, they are often limited to static scenes due to long acquisition times. Here, we introduce Event Ellipsometer, a method for acquiring Mueller-matrix images of dynamic scenes. Our imaging system employs fast-rotating quarter-wave plates (QWPs) in front of a light source and an event camera that asynchronously captures intensity changes induced by the rotating QWPs. We develop an ellipsometric-event image formation model, a calibration method, and an ellipsometric-event reconstruction method. We experimentally demonstrate that Event Ellipsometer enables Mueller-matrix imaging at 30fps, extending ellipsometry to dynamic scenes.
Poster
Noah Stier · Alex Rich · Pradeep Sen · Tobias Höllerer
[ ExHall D ]
Abstract
Recent image-based 3D reconstruction methods have achieved excellent quality for indoor scenes using 3D convolutional neural networks. However, they rely on a high-resolution grid in order to achieve detailed output surfaces, which is quite costly in terms of compute time, and it results in large mesh sizes that are more expensive to store, transmit, and render. In this paper we propose a new solution to this problem, using adaptive sampling. By re-formulating the final layers of the network, we are able to analytically bound the local surface complexity, and set the local sample rate accordingly. Our method, AniGrad, achieves an order of magnitude reduction in both surface extraction latency and mesh size, while preserving mesh accuracy and detail.
Poster
Zetong Zhang · Manuel Kaufmann · Lixin Xue · Jie Song · Martin R. Oswald
[ ExHall D ]
Abstract
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in human pose estimation, novel view synthesis and runtime.
Poster
Hongtao Yu · Shaohui Song · Lihu Sun · Wenkai Su · Xiaodong Yang · Chengming Liu
[ ExHall D ]
Abstract
Quad Photodiode (QPD) sensors represent an evolution by providing four sub-views, whereas dual-pixel (DP) sensors are limited to two sub-views. In addition to enhancing auto-focus performance, QPD sensors also enable disparity estimation in horizontal and vertical directions. However, the characteristics of QPD sensors, including uneven illumination across sub-views and the narrow baseline, render algorithm design difficult. Furthermore, effectively utilizing the two-directional disparity of QPD sensors remains a challenge. The scarcity of QPD disparity datasets also limits the development of learning-based methods. In this work, we address these challenges by first proposing a DPNet for DP disparity estimation. Specifically, we design an illumination-invariant module to reduce the impact of illumination, followed by a coarse-to-fine module to estimate sub-pixel disparity. Building upon the DPNet, we further propose a QuadNet, which integrates the two-directional disparity via an edge-aware fusion module. To facilitate the evaluation of our approaches, we propose the first QPD disparity dataset QPD2K, comprising 2,100 real-world QPD images and corresponding disparity maps. Experiments demonstrate that our approaches achieve state-of-the-art performance in DP and QPD disparity estimation.
Poster
Songsong Yu · Yuxin Chen · Zhongang Qi · Zeke Xie · Yifan Wang · Lijun Wang · Ying Shan · Huchuan Lu
[ ExHall D ]
Abstract
With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in …
Poster
Hualie Jiang · Zhiqiang Lou · Laiyan Ding · Rui Xu · Minglang Tan · jerett · Rui Huang
[ ExHall D ]
Abstract
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D leaderboards, ranking 1st on many metrics. The code and models will be made publicly available.
Poster
Marwane Hariat · Antoine Manzanera · David Filliat
[ ExHall D ]
Abstract
Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI and Cityscapes, our model demonstrates robust performance, surpassing conventional self-supervised methods in MDE.
Poster
Weilong Yan · Ming Li · Li Haipeng · Shuwei Shao · Robby T. Tan
[ ExHall D ]
Abstract
Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data.Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results.In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions.In the innovative real adaptation which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels.We further introduce a new regularization by gathering explicit depth distribution prior to constrain the model facing real-world data.Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame settings. We achieve improvements of 7.5\% in AbsRel and 4.3\% in RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation on DrivingStereo (rain, fog), our method generalizes better than previous ones. …
Poster
Zador Pataki · Paul-Edouard Sarlin · Johannes Schönberger · Marc Pollefeys
[ ExHall D ]
Abstract
While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap or low-parallax conditions.Because capturing images that avoid both pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users.In this paper, we overcome both limitations by augmenting the classical SfM paradigm with monocular depth and normal priors, which can be inferred by deep neural networks with increasing accuracy.Our approach is significantly more robust than existing ones in extreme low- or high-overlap scenarios but retains state-of-the-art performance in easier, nominal conditions thanks to a tight integration of monocular and multi-view constraints.We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM.Thanks to principled uncertainty propagation, our approach is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation.
Poster
Daniil Sinitsyn · Linus Härenstam-Nielsen · Daniel Cremers
[ ExHall D ]
Abstract
We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions.Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate.In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion.Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.
Poster
Charalambos Tzamos · Viktor Kocur · Yaqing Ding · Daniel Barath · Zuzana Berger Haladova · Torsten Sattler · Zuzana Kukelova
[ ExHall D ]
Abstract
We study the challenging problem of estimating the relative pose of three calibrated cameras from four point correspondences. We propose novel efficient solutions to this problem that are based on the simple idea of using four correspondences to estimate an approximate geometry of the first two views. We model this geometry either as an affine or a fully perspective geometry estimated using one additional approximate correspondence. We generate such an approximate correspondence using a very simple and efficient strategy, where the new point is the mean point of three corresponding input points. The new solvers are efficient and easy to implement, since they are based on existing efficient minimal solvers, i.e., the 4-point affine fundamental matrix, the well-known 5-point relative pose solver, and the \texttt{P3P} solver. Extensive experiments on real data show that the proposed solvers, when properly coupled with local optimization, achieve state-of-the-art results, with the novel solver based on approximate mean-point correspondences being more robust and precise than the affine-based solver.
Poster
Jianing Yang · Alexander Sax · Kevin Liang · Mikael Henaff · Hao Tang · Ang Cao · Joyce Chai · Franziska Meier · Matt Feiszli
[ ExHall D ]
Abstract
Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing multiple views in parallel. Fast3R's Transformer-based architecture forwards N images in a single pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.
Poster
Shangzhan Zhang · Jianyuan Wang · Yinghao Xu · Nan Xue · Christian Rupprecht · Xiaowei Zhou · Yujun Shen · Gordon Wetzstein
[ ExHall D ]
Abstract
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications.Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis.Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds).
Poster
Runfeng Li · Mikhail Okunev · Zixuan Guo · Anh H Duong · Christian Richardt · Matthew O’Toole · James Tompkin
[ ExHall D ]
Abstract
We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight cameras using raw sensor samples that is as accurate as past methods and is 100× faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. Recent 3D Gaussian splatting methods often depend on multi-view data to produce satisfactory results and are brittle in their optimizations otherwise.In time-of-flight radiance field reconstruction, the property of interest---depth---is not directly optimized, causing additional challenges.We describe how these problems have a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussians.Then, we incorporate two heuristics into our optimization to improve the accuracy of scene geometry for under-constrained time-of-flight Gaussians.Experimental results show that our approach produces accurate reconstructions under constrained sensing conditions, including for fast motions like swinging baseball bats.
Poster
Lea Müller · Hongsuk Choi · Anthony Zhang · Brent Yi · Jitendra Malik · Angjoo Kanazawa
[ ExHall D ]
Abstract
We introduce Humans and Structure from Motion'', a novel approach for reconstructing multiple people within a metric world coordinate system from a sparse set of images capturing a scene. Our method jointly estimates human body pose, shape, camera positions, and scene structure, capturing the spatial relationships among people and their location in the environment. Unlike existing methods that require calibrated setups, our approach operates with minimal constraints by leveraging the strength of both human body priors and data-driven SfM. By leveraging multi-view geometry, our method is the first work that effectively recovers humans and scene structure without assumptions about human-scene contact. We evaluate our approach on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human location estimation within the world coordinate frame (3.51m to 1.04m and 2.9m to 0.56m respectively). Notably, our results also reveal that incorporating human data in the classical SfM task improves camera pose estimation (RRA@15: 0.74 to 0.89 in EgoHumans), when multiple humans are used for correspondence. We will release our code and data.
Poster
Kai Luo · Hao Shi · Sheng Wu · Fei Teng · Mengfei Duan · Chang Huang · Yuhang Wang · Kaiwei Wang · Kailun Yang
[ ExHall D ]
Abstract
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in large field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset—a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as wide fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45\% on QuadTrack, surpassing the baseline by 6.81%. The dataset …
Poster
Jintao Zhang · Zimin Xia · Mingyue Dong · Shuhan Shen · Linwei Yue · Xianwei Zheng
[ ExHall D ]
Abstract
This paper proposes a multi-view collaborative matching strategy to address the issue of sparse and broken tracks in Structure-from-Motion. We observe that the two-view matching paradigms applied to image set matching often lead to unreliable correspondences when the selected independent image pairs exhibit low connection, high occlusions and drastic viewpoint changes. This is due to the significant loss of information during 3D-to-2D projection and two-view images can only provide a very limited perception of the holistic 3D scene. Accordingly, we propose a multi-view collaborative matching network (CoMatcher) that (i) leverages complementary context cues from different views to form a holistic understanding of the 3D scene and (ii) utilizes multi-view consistency constraints to infer a globally optimal solution. Extensive experiments on various complicated scenarios demonstrates the superiority of our multi-view collaborative matching strategy over the mainstream two-view matching paradigm.
Poster
WooJu Lee · Juhye Park · Dasol Hong · Changki Sung · Youngwoo Seo · DongWan Kang · Hyun Myung
[ ExHall D ]
Abstract
Accurate localization is essential for autonomous driving, but GNSS-based methods struggle in challenging environments such as urban canyons. Cross-view pose optimization offers an effective solution by directly estimating vehicle pose using satellite-view images. However, existing methods primarily rely on cross-view features at a given pose, neglecting fine-grained contexts for precision and global contexts for robustness against large initial pose errors. To overcome these limitations, we propose PIDLoc, a novel cross-view pose optimization approach inspired by the proportional-integral-derivative (PID) controller. The PIDLoc comprises the PID branches to model cross-view feature relationships and the spatially aware pose estimator (SPE) to estimate the pose from these relationships. The PID branches leverage feature differences for local context (P branch), aggregated feature differences for global context (I branch), and gradients of feature differences for precise pose adjustment (D branch) to enhance localization accuracy under large initial pose errors. Integrated with the PID branches, the \dnnpose{} captures spatial relationships within the PID-branch features for consistent localization. Experimental results demonstrate that the PIDLoc achieves state-of-the-art performance in cross-view pose estimation for the KITTI dataset, reducing position error by 37.8% compared with the previous state-of-the-art.
Poster
Shengze Wang · Jiefeng Li · Tianye Li · Ye Yuan · Henry Fuchs · Koki Nagano · Shalini De Mello · Michael Stengel
[ ExHall D ]
Abstract
Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera.Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide …
Poster
Wanhua Li · Renping Zhou · Jiawei Zhou · Yingwei Song · Johannes Herter · Minghan Qin · Gao Huang · Hanspeter Pfister
[ ExHall D ]
Abstract
Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing …
Poster
Gyeongjin Kang · Jisang Yoo · Jihyeon Park · Seungtae Nam · Hyeonsoo Im · Shin sangheon · Sangpil Kim · Eunbyung Park
[ ExHall D ]
Abstract
We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without fine-tuning, making it difficult for conventional methods to produce high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods.
Poster
Xingyu Liu · Gu Wang · Ruida Zhang · Chenyangguang Zhang · Federico Tombari · Xiangyang Ji
[ ExHall D ]
Abstract
Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for UNseen One-reference-based object Pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset will be available upon acceptance.
Poster
Junjie Chen · Weilong Chen · Yifan Zuo · Yuming Fang
[ ExHall D ]
Abstract
Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform …
Poster
Qingyuan Wang · Rui Song · Jiaojiao Li · Kerui Cheng · David Ferstl · Yinlin Hu
[ ExHall D ]
Abstract
We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinements either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for iterative RGB refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent match network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train the refinement network on a combination of dataset Objaverse, GSO and ShapeNet, and demonstrate on BOP datasets with novel objects that, after using our method, the result of most state-of-the-art methods improves significantly, without any retraining or fine-tuning.
Poster
Ziqin Huang · Gu Wang · Chenyangguang Zhang · Ruida Zhang · Xiu Li · Xiangyang Ji
[ ExHall D ]
Abstract
Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation.
Poster
Wen-Hsuan Chu · Lei Ke · Jianmeng Liu · Mingxiao Huo · Pavel Tokmakov · Katerina Fragkiadaki
[ ExHall D ]
Abstract
We address the challenging problem of generating a dynamic 4D scene across views and over time from monocular videos. We target in-the-wild multi-object videos with heavy occlusions and propose Robust4DGen, a model that decomposes the scene into object tracks and optimizes a differentiable and deformable set of 3D Gaussians for each. Robust4DGen captures 2D occlusions from a 3D perspective by jointly splatting Gaussians of all objects to compute rendering errors in observed frames. Rather than relying on scene-level view generation models, which struggle to generalize due to the combinatorial complexity of scene views, we keep the Gaussian grouping information and additionally utilize object-centric, view-conditioned generative models for each entity to optimize score distillation objectives from unobserved viewpoints. We achieve this by applying differentiable affine transformations to jointly optimize both global image re-projection and object-centric score distillation objectives within a unified framework. To enable a thorough evaluation of generation and motion accuracy under multi-object occlusions, we annotate MOSE-PTS with accurate 2D point tracks, which is a subset of the challenging MOSE video segmentation benchmark. Through quantitative analysis and human evaluation, we demonstrate that our method generates more realistic 4D multi-object scenes and produces more accurate point tracks across spatial and temporal …
Poster
Guangzhao He · Chen Geng · Shangzhe Wu · Jiajun Wu
[ ExHall D ]
Abstract
The motion of deformable 4D objects lies in a low-dimensional manifold. To better capture the low dimensionality and enable better controllability, traditional methods have devised several heuristic-based methods, i.e., rigging, to manipulate the dynamic objects intuitively. However, such representations are not scalable due to the need for expert knowledge of specific categories. Instead, we study the automatic exploration of such low-dimensional structures in a purely data-driven manner. Specifically, we design a novel representation that encodes deformable 4D objects into a sparse set of spatially grounded blobs and an instance-aware feature volume to disentangle the pose and instance information of the 3D shape. With such a representation, we can manipulate the pose of 3D objects intuitively by modifying the parameter of blobs, while preserving the rich instance-specific information. We evaluate the proposed method on a variety of object categories and demonstrate the effectiveness of the proposed framework.
Poster
Zekai Shao · Yufan Hu · Bin Fan · Hongmin Liu
[ ExHall D ]
Abstract
Maintaining stable tracking of objects in domain shift scenarios is crucial for RGB-T tracking, prompting us to explore the use of unlabeled test sample information for effective online model adaptation. However, current Test-Time Adaptation (TTA) methods in RGB-T tracking dramatically change the model's internal parameters during long-term adaptation. At the same time, the gradient computations involved in the optimization process impose a significant computational burden. To address these challenges, we propose a Parameter Update-Recovery Adaptation (PURA) framework based on parameter decomposition. Firstly, Our fast parameter update strategy adjusts model parameters using statistical information from test samples without requiring gradient calculations, ensuring consistency between the model and test data distribution. Secondly, our parameter decomposition recovery employs orthogonal decomposition to identify the principal update direction and recover parameters in this direction, aiding in the retention of critical knowledge. Finally, we leverage the information obtained from decomposition to provide feedback on the momentum during the update phase, ensuring a stable updating process. Experimental results demonstrate that PURA outperforms current state-of-the-art methods across multiple datasets, validating its effectiveness. The code is available in the Supplementary Materials.
Poster
Xinyu Xiang · Qinglong Yan · HAO ZHANG · Jiayi Ma
[ ExHall D ]
Abstract
The research on adversarial attacks against trackers primarily concentrates on the RGB modality, whereas the methodology for attacking RGB-T multi-modal trackers has not been explored so far. This work represents an innovative attempt to develop an adaptive cross attack framework via multi-modal response decoupling, generating multi-modal adversarial patches to evade RGB-T trackers. Specifically, a modal-aware adaptive attack strategy is introduced to weaken the modality with high common information contribution alternately and iteratively, achieving the modal decoupling attack. In order to perturb the judgment of the modal balance mechanism in the tracker, we design a modal disturbance loss to increase the distance of the response map of the single-modal adversarial samples in the tracker. Besides, we also propose a novel spatio-temporal joint attack loss to progressively deteriorate the tracker's perception of the target. Moreover, the design of the shared adversarial shape enables the generated multi-modal adversarial patches to be readily deployed in real-world scenarios, effectively reducing the interference of the patch posting process on the shape attack of the infrared adversarial layer. Extensive digital and physical domain experiments demonstrate the effectiveness of our multi-modal adversarial patch attack.
Poster
Ahyun Seo · Minsu Cho
[ ExHall D ]
Abstract
Symmetry is crucial for understanding structural patterns and supports tasks such as object recognition and scene understanding. This paper focuses on rotational symmetry, where objects remain unchanged when rotated around a central axis, requiring the detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching for identifying rotation centers and vertices, while recent approaches use convolutional neural networks (CNNs) as segmentation models for rotation center detection. However, 2D-based models struggle to preserve 3D geometric properties due to distortions caused by viewpoint variation. To address this, we propose a rotation symmetry detection model that directly predicts rotation centers and vertices in 3D space, projecting the results back to 2D while maintaining structural consistency. By incorporating a vertex reconstruction stage that enforces 3D geometric priors—such as equal side lengths and interior angles for regular polygons—our model achieves greater robustness and geometric accuracy. Experiments on DENDI dataset show that our approach outperforms previous state-of-the-art methods in rotation center detection and demonstrates the effectiveness of 3D geometric priors through ablation studies on vertex reconstruction.
Poster
Shining Wang · Yunlong Wang · Ruiqi Wu · Bingliang Jiao · Wenxuan Wang · Peng Wang
[ ExHall D ]
Abstract
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering 4,231 unique identities and containing 63,841 high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible …
Poster
Eric Hedlin · Munawar Hayat · Fatih Porikli · Kwang Moo Yi · Shweta Mahajan
[ ExHall D ]
Abstract
To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. While hypernetworks work well, training them is cumbersome, and often requires ground truth optimized weights for each sample. However, obtaining each of these weights is a training problem of its own---one needs to train, e.g., adaptation weights or even an entire neural field for hypernetworks to regress to. In this work, we propose a method to train hypernetworks, without the need for any per-sample ground truth. Our key idea is to learn a Hypernetwork `Field' and estimate the entire trajectory of network weight training instead of simply its converged state. In other words, we introduce an additional input to the Hypernetwork, the convergence state, which then makes it act as a neural field that models the entire convergence pathway of a task network. A critical benefit in doing so is that the gradient of the estimated weights at any convergence state must then match the gradients of the original task---this constraint alone is sufficient to train the Hypernetwork Field. We demonstrate the effectiveness of our method through the task of personalized image generation and 3D shape reconstruction from images and point clouds, demonstrating …
Poster
Takeshi Noda · Chao Chen · Junsheng Zhou · Weiqi Zhang · Yu-Shen Liu · Zhizhong Han
[ ExHall D ]
Abstract
Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods.
Poster
Xinran Yang · Donghao Ji · Yuanqi Li · Junyuan Xie · Jie Guo · Yanwen Guo
[ ExHall D ]
Abstract
Point cloud reconstruction is a critical process in 3D representation and reverse engineering. When it comes to CAD models, edges are significant features that play a crucial role in characterizing the geometry of 3D shapes. However, few points are exactly sampled on edges during acquisition, resulting in apparent artifacts for the reconstruction task. Upsampling point cloud is a direct technical route, but there is a main challenge that the upsampled points may not align with the model edge accurately. To overcome this, we develop an integrated framework to estimate edges by joint regression of three geometry features—point-to-edge direction, point-to-edge distance and point normal. Benefiting these features, we implement a novel refinement process to move and produce more points which lie accurately on edges of the model, allowing for high-quality edge-preserving reconstruction. Experiments and comparisons against previous methods demonstrate our method's effectiveness and superiority.
Poster
Lin Bie · Shouan Pan · Siqi Li · Yining Zhao · Yue Gao
[ ExHall D ]
Abstract
Although the fusion of images and LiDAR point clouds is crucial to many applications in computer vision, the relative poses of cameras and LiDAR scanners are often unknown. The general registration pipeline first establishes correspondences and then performs pose estimation based on the generated matches. However, 2D-3D correspondences are inherently challenging to establish due to the large gap between images and LiDAR point clouds. To this end, we build a bridge to alleviate the 2D-3D gap and propose a practical framework to align LiDAR point clouds to the virtual points generated by images. In this way, the modality gap is converted to the domain gap of point clouds. Moreover, we propose a virtual-spherical representation and adaptive distribution sample module to narrow the domain gap between virtual and LiDAR point clouds. Then, we explore the reliable correspondence pattern consistency through a graph-based selection process. We improve the correspondence representation through a graph neural network. Experimental results demonstrate that our method outperforms the state-of-the-art methods by more than 10.77% and 12.53% performance on the KITTI Odometry and nuScenes datasets, respectively. The results demonstrate that our method can effectively solve non-synchronized random frame registration.
Poster
Kang You · Tong Chen · Dandan Ding · M. Salman Asif · Zhan Ma
[ ExHall D ]
Abstract
Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression—an indispensable criterion for numerous industrial applications—remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels' occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25\% and 48.34\% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code will be made publicly available.
Poster
Changshuo Wang · Shuting He · Xiang Fang · Jiawei Han · Zhonghang Liu · Xin Ning · Weijun Li · Prayag Tiwari
[ ExHall D ]
Abstract
While existing pre-training-based methods have enhanced point cloud model performance, they have not fundamentally resolved the challenge of local structure representation in point clouds. The limited representational capacity of pure point cloud models continues to constrain the potential of cross-modal fusion methods and performance across various tasks. To address this challenge, we propose a Dynamic Acoustic Field Fitting Network (DAF-Net), inspired by physical acoustic principles. Specifically, we represent local point clouds as acoustic fields and introduce a novel Acoustic Field Convolution (AF-Conv), which treats local aggregation as an acoustic energy field modeling problem and captures fine-grained local shape awareness by dividing the local area into near field and far field. Furthermore, drawing inspiration from multi-frequency wave phenomena and dynamic convolution, we develop the Dynamic Acoustic Field Convolution (DAF-Conv) based on AF-Conv. DAF-Conv dynamically generates multiple weights based on local geometric priors, effectively enhancing adaptability to diverse geometric features. Additionally, we design a Global Shape-Aware (GSA) layer incorporating EdgeConv and multi-head attention mechanisms, which combines with DAF-Conv to form the DAF Block. These blocks are then stacked to create a hierarchical DAFNet architecture. Extensive experiments on point cloud classification, part segmentation, and few-shot semantic segmentation demonstrate that DAFNet significantly outperforms existing …
Poster
Xiaoyang Wu · Daniel DeTone · Duncan Frost · TIANWEI SHEN · Chris Xie · Nan Yang · Jakob Engel · Richard Newcombe · Hengshuang Zhao · Julian Straub
[ ExHall D ]
Abstract
In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the geometric shortcut, which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks. All code and weights will be made available.
Poster
Qi Zhang · Jibin Peng · Zhao Huang · Wei Feng · Di Lin
[ ExHall D ]
Abstract
The recent progress in semantic point cloud segmentation is attributed to deep networks, which require a large amount of point cloud data for training. However, how to collect substantial point-wise annotations of the point clouds at affordable cost for the end-to-end network training still needs to be solved. In this paper, we propose Generative Hard Example Augmentation (GHEA) to achieve novel examples of point clouds, which enrich the data for training the segmentation network. Firstly, GHEA employs the generative network to embed the discrepancy between the point clouds into the latent space. From the latent space, we sample multiple discrepancies for reshaping a point cloud to various examples, contributing to the richness of the training data. Secondly, GHEA mixes the reshaped point clouds by respecting their segmentation errors. This mixup allows the reshaped point clouds, which are difficult to segment, to join as the challenging example for network training. We evaluate the effectiveness of GHEA, which helps the popular segmentation networks to improve the performances.
Poster
Yuzhou Liu · Lingjie Zhu · Hanqiao Ye · Shangfeng Huang · Xiang Gao · Xianwei Zheng · Shuhan Shen
[ ExHall D ]
Abstract
In this paper, we present BWFormer, a novel Transformer-based model for building wireframe reconstruction from airborne LiDAR point cloud. The problem is solved in a ground-up manner by detecting the building corners in 2D, lifting and connecting them in 3D space afterwards with additional data augmentation.Due to the 2.5D characteristic of the airborne LiDAR point cloud, we simplify the problem by projecting the points on the ground plane to produce a 2D height map. With the height map, a heat map is first predicted with pixel-wise corner likelihood to predict the possible 2D corners.Then, 3D corners are predicted by a Transformer-based network with extra height embedding initialization.This 2D-to-3D corner detection strategy reduces the search space significantly.To recover the topological connections among the corners, edges are finally predicted from geometrical and visual cues in the height map with the proposed edge attention mechanism, which extracts holistic features and preserves local details simultaneously.In addition, due to the limited datasets in the field and the irregularity of the point clouds, a conditional latent diffusion model for LiDAR scanning simulation is utilized for data augmentation.BWFormer surpasses other state-of-the-art methods, especially in reconstruction completeness. We commit to release all our codes and pre-trained models.
Poster
Justin Lazarow · David Griffiths · Gefen Kohavi · Francisco Crespo · Afshin Dehghan
[ ExHall D ]
Abstract
We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the **Cubify-Anything 1M (CA-1M) dataset**, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish **Cubify Transformer (CuTR)**, a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods on CA-1M - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in …
Poster
Mohamed Abdelsamad · Michael Ulrich · Claudius Glaeser · Abhinav Valada
[ ExHall D ]
Abstract
Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.
Poster
Zhenxuan Zeng · Qiao Wu · Xiyu Zhang · Lin Yuanbo Wu · Pei An · Jiaqi Yang · Ji Wang · Peng Wang
[ ExHall D ]
Abstract
In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird’s Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network’s generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5\% and 91.4\%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9\%.
Poster
Yingping Liang · Yutao Hu · Wenqi Shao · Ying Fu
[ ExHall D ]
Abstract
Depth completion involves predicting dense depth maps from sparse LiDAR inputs, a critical task for applications such as autonomous driving and robotics. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. To overcome this limitation, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images to distill geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Nonetheless, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results show that models trained with our two-stage distillation framework achieve top-ranked performance on the KITTI benchmark, demonstrating improvements in both quantitative and qualitative metrics.
Poster
Hou-I Liu · Christine Wu · Jen-Hao Cheng · Wenhao Chai · Shian-yun Wang · Gaowen Liu · Hugo Latapie · Jhih-Ciang Wu · Jenq-Neng Hwang · Hong-Han Shuai · Wen-Huang Cheng
[ ExHall D ]
Abstract
Monocular 3D object detection (Mono3D) holds noteworthy promise for autonomous driving applications owing to the cost-effectiveness and rich visual context of monocular camera sensors. However, depth ambiguity poses a significant challenge, as it requires extracting precise 3D scene geometry from a single image, resulting in suboptimal performance when transferring knowledge from a LiDAR-based teacher model to a camera-based student model. To address this issue, we introduce Monocular Teaching Assistant Knowledge Distillation (MonoTAKD) to enhance 3D perception in Mono3D. Our approach presents a robust camera-based teaching assistant model that effectively bridges the representation gap between different modalities for teacher and student models, addressing the challenge of inaccurate depth estimation. By defining 3D spatial cues as residual features that capture the differences between the teacher and the teaching assistant models, we leverage these cues into the student model, improving its 3D perception capabilities. Experimental results show that our MonoTAKD achieves state-of-the-art performance on the KITTI3D dataset. Additionally, we evaluate the performance on nuScenes and KITTI raw datasets to demonstrate the generalization of our model to multi-view 3D and unsupervised data settings.
Poster
Yunfei Long · Abhinav Kumar · Xiaoming Liu · Daniel Morris
[ ExHall D ]
Abstract
Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. We will release the model and code upon publication.
Poster
Xingyue Liu · Jiahao Qi · Chen Chen · Kangcheng Bin · Ping Zhong
[ ExHall D ]
Abstract
Cross-Modality Re-Identification (VI-ReID) aims to achieve around-the-clock target matching, benefiting from the strengths of both RGB and infrared (IR) modalities. However, the field is hindered by limited datasets, particularly for vehicle VI-ReID, and by challenges such as modality bias training (MBT), stemming from biased pre-training on ImageNet. To tackle the above issues, this paper introduces an UCM-VeID V2 dataset benchmark for vehicle VI-ReID, and proposes a new self-supervised pre-training method, Cross-Modality Patch-Mixed Self-supervised Learning (PMSL). UCM-VeID V2 dataset features a significant increase in data volume, along with enhancements in multiple aspects. PMSL addresses MBT by learning modality-invariant features through Patch-Mixed Image Reconstruction (PMIR) and Modality Discrimination Adversarial Learning (MDAL), and enhances discriminability with Modality-Augmented Contrasting Cluster (MACC). Comprehensive experiments are carried out to validate the effectiveness of the proposed method.
Poster
Yunshuang Yuan · Yan Xia · Daniel Cremers · Monika Sester
[ ExHall D ]
Abstract
Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which is computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, \textit{SparseAlign}, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.
Poster
Luke Chen · Junyao Wang · Trier Mortlock · Pramod Khargonekar · Mohammad Al Faruque
[ ExHall D ]
Abstract
Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real-world autonomous systems. However, existing approaches typically quantify task-level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub-optimal outcomes.Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature-level epistemic uncertainty by leveraging hyperdimensional computing.Our method captures the channel and spatial uncertainties through channel and patch -wise projection and bundling techniques respectively.Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion.Our evaluations show that HyperDUM on average outperforms the state-of-the-art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties.Notably, HyperDUM requires 2.36× less Floating Point Operations and up to 38.30× less parameters than SOTA methods, providing an efficient solution for real-world autonomous systems.
Poster
Dongxu Wei · Zhiqi Li · Peidong Liu
[ ExHall D ]
Abstract
Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.
Poster
David T. Hoffmann · Syed Haseeb Raza · Hanqiu Jiang · Steffen Klingenhoefer · Denis Tananaev · Martin Meinke
[ ExHall D ]
Abstract
Scene flow estimation is a foundational task for many robotics applications, ranging from robust dynamic object detection to automatic labeling and sensor synchronization. Two distinct approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. While supervised methods are fast during inference and achieve high-quality results, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime or fail to converge to the right solution. Current optimization-based approaches often perform poorly on dynamic objects and mainly predict ego-motion. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that exhibits advantageous characteristics compared to the standard MLP-based formulation and 2) introduce a new multi-frame loss formulation. We combine both contributions in our new method, termed Floxels. On our ego-motion compensated benchmark, based on nuScenes and Argoverse, Floxels achieves state of the art (SOTA) results and performs on par with a recently proposed SOTA supervised method. At the same time compute costs scale significantly more gracefully with point cloud …
Poster
Jingyi Xu · Xieyuanli Chen · Junyi Ma · Jiawei Huang · Jintao Xu · Yue Wang · Ling Pei
[ ExHall D ]
Abstract
The task of occupancy forecasting (OCF) involves utilizing past and present perception data to predict future occupancy states of autonomous vehicle surrounding environments, which is critical for downstream tasks such as obstacle avoidance and path planning. Existing 3D OCF approaches struggle to predict plausible spatial details for movable objects and suffer from slow inference speeds due to neglecting the bias and uneven distribution of changing occupancy states in both space and time. In this paper, we propose a novel spatiotemporal decoupling vision-based paradigm to explicitly tackle the bias and achieve both effective and efficient 3D OCF. To tackle spatial bias in empty areas, we introduce a novel spatial representation that decouples the conventional dense 3D format into 2D bird’s-eye view (BEV) occupancy with corresponding height values, enabling 3D OCF derived only from 2D predictions thus enhancing efficiency. To reduce temporal bias on static voxels, we design temporal decoupling to improve end-to-end OCF by temporally associating instances via predicted flows. We develop an efficient multi-head network EfficientOCF to achieve 3D OCF with our devised spatiotemporally decoupled representation. A new metric, conditional IoU (C-IoU), is also introduced to provide a robust 3D OCF performance assessment, especially in datasets with missing or incomplete …
Poster
Rui Gong · Kim-Hui Yap · Weide Liu · Xulei Yang · Jun Cheng
[ ExHall D ]
Abstract
Online stereo rectification is critical for autonomous vehicles and robots in dynamic environments, where factors such as vibration, temperature fluctuations, and mechanical stress can affect rectification accuracy and severely degrade downstream stereo depth estimation. Current dominant approaches for online stereo rectification involve estimating relative camera poses in real time to derive rectification homographies. However, they do not directly optimize for rectification constraints, which leads to a gap. Additionally, the general-purpose correspondence matchers used in these methods are not trained for stereo rectification, while training of these matchers typically requires ground-truth correspondences which are not available in stereo rectification datasets. To address these limitations, we propose a matching-based stereo rectification framework that is directly optimized for rectification and does not require ground-truth correspondence annotations for training. Our framework incorporates a rectification-constrained estimator and applies multi-level, rectification-specific supervision that trains the matcher network for rectification without relying on ground-truth correspondences. Additionally, we create a new rectification dataset with ground-truth optical flow annotations, eliminating bias from evaluation metrics used in prior work that relied on pretrained keypoint matching or optical flow models. Extensive experiments show that our approach outperforms both state-of-the-art matching-based and matching-free methods in vertical flow metric by 10.7% on the …
Poster
Xiaolu Liu · Ruizi Yang · Song Wang · Wentong Li · Junbo Chen · Jianke Zhu
[ ExHall D ]
Abstract
Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. While recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose \textbf{\textit{UIGenMap}}, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce the excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on nuScenes dataset. Our code will be made publicly available.
Poster
yunlong lin · Zixu Lin · Haoyu Chen · Panwang Pan · Chenxin Li · Sixiang Chen · Kairun Wen · Yeying Jin · Wenbo Li · Xinghao Ding
[ ExHall D ]
Abstract
Vision-centric perception systems for autonomous driving often struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM (e.g., Llava-Llama3) as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50\% improvement in the average of all perception metrics on CleanBench-Real. Furthermore, it effectively supports high-level tasks, such as semantic segmentation and object detection.
Poster
Jingcheng Ni · Yuxin Guo · Yichen Liu · Rui Chen · Lewei Lu · Zehuan Wu
[ ExHall D ]
Abstract
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on pixel-level video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining pixel-level generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation.Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, …
Poster
Ze Yang · Jingkang Wang · Haowei Zhang · Sivabalan Manivasagam · Yun Chen · Raquel Urtasun
[ ExHall D ]
Abstract
High-quality 3D assets for traffic participants such as vehicles and motorcycles is critical for multi-sensor simulation, which is required for the safe end-to-end development of autonomy. Building assets from in-the-wild real-world data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that can only render close to the original viewpoints of observed actors, restricting usage in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high quality 3D assets with complete geometry and appearance. Key to our method is a reconstruct-then-generate'' approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a generative diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generative-based methods, unlocking diverse and scalable content creation for simulation.
Poster
Mariam Hassan · Sebastian Stapf · Ahmad Rahimi · Pedro M B Rezende · Yasaman Haghighi · David Brüggemann · Isinsu Katircioglu · Lin Zhang · Xiaoran Chen · Suman Saha · Marco Cannici · Elie Aljalbout · Botao Ye · Xi Wang · Aram Davtyan · Mathieu Salzmann · Davide Scaramuzza · Marc Pollefeys · Paolo Favaro · Alex Alahi
[ ExHall D ]
Abstract
World models predict future frames from past observations and actions, making them powerful simulators for ego-vision tasks with complex dynamics, such as autonomous driving. Nonetheless, existing world models for ego-vision mainly focus on the driving domain and the ego-vehicle's actions limiting the complexity and diversity of the generated scenes. In this work, we propose \textit{GEM}, a diffusion-based world model with generalized control strategy. By leveraging ego-trajectories and general image features, GEM not only allows for fine-grained control over the ego-motion, but also enables to control the motion of other objects in the scene and supports scene composition, by inserting new objects. GEM is multimodal, capable of generating both videos and future depth sequences, providing rich semantic and spatial output contexts. Although our primary focus remains on the domain of autonomous driving, we explore the adaptability of GEM to other ego-vision domain such as human activity and drone navigation. To evaluate GEM’s controllability, we propose a comprehensive evaluation framework. The results show the effectiveness of GEM in controlling the motion of objects within the scene, with conditional generation outperforming unconditional generation by 68% and 79% on Nuscenes and OpenDV respectively.
Poster
Inhwan Bae · Junoh Lee · Hae-Gon Jeon
[ ExHall D ]
Abstract
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. …
Poster
Ziying Song · Caiyan Jia · Lin Liu · Hongyu Pan · Yongchang Zhang · Junming Wang · Xingyu Zhang · Shaoqing Xu · Lei Yang · Yadan Luo
[ ExHall D ]
Abstract
End-to-end autonomous driving frameworks enable seamless integration of perception and planning but often rely on one-shot trajectory prediction, which may lead to unstable control and vulnerability to occlusions in single-frame perception. To address this, we propose the Momentum-Aware Driving (MomAD) framework, which introduces trajectory momentum and perception momentum to stabilize and refine trajectory predictions. MomAD comprises two core components: (1) Topological Trajectory Matching (TTM) employs Hausdorff Distance to select the optimal planning query that aligns with prior paths to ensure coherence; (2) Momentum Planning Interactor (MPI) cross-attends the selected planning query with historical queries to expand static and dynamic perception files. This enriched query, in turn, helps regenerate long-horizon trajectory and reduce collision risks. To mitigate noise arising from dynamic environments and detection errors, we introduce robust instance denosing during training, enabling the planning model to focus on critical signals and improve its robustness. To quantify planning stability, we introduce a novel Trajectory Prediction Consistency (TPC) metric. Experiments on the nuScenes dataset demonstrate that MomAD achieves superior long-term consistency (≥3s) compared to SOTA methods. Furthermore, we curate a Turning-nuScenes validation set to evaluate model performance in challenging turning scenarios, where MomAD reduces the collision rate by 26\% and TPC …
Poster
Shihao Wang · Zhiding Yu · Xiaohui Jiang · Shiyi Lan · Min Shi · Nadine Chang · Jan Kautz · Ying Li · Jose M. Alvarez
[ ExHall D ]
Abstract
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
Poster
Weizhen Wang · Chenda Duan · Zhenghao Peng · Yuxin Liu · Bolei Zhou
[ ExHall D ]
Abstract
Vision Language Models (VLMs) show promise as embodied agents in many mobility applications, yet there is a lack of a generalizable platform for evaluating their spatial reasoning and embodied scene understanding. We introduce MetaVQA, a comprehensive benchmark that assesses and enhances VLMs’ understanding of spatial relationships and embodied dynamics in driving scenes through Visual-Question-Answering (VQA) and closed-loop simulation. MetaVQA collects various question-answer pairs from diverse real-world traffic scenarios through Set-of-Mark prompting and top-down view ground-truth annotations of nuScenes and Waymo datasets to ensure real-world and object-centric instructions. We demonstrate that fine-tuning VLMs on the MetaVQA dataset improves their spatial reasoning and embodied scene understanding in safety-critical simulations. Code and data will be made available.
Poster
Kai Chen · Xiaodong Zhao · Yujie Huang · GuoyuFang · Xiao Song · Ruiping Wang · Ziyuan Wang
[ ExHall D ]
Abstract
The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to the inherent high uncertainty of agent intentions and the complex higher-order influences among neighboring groups. SocialMOIF is proposed to tackle these challenges, concentrating on the higher-order intention interactions among neighboring groups while reinforcing the primary role of first-order intention interactions between neighbors and the target agent. This method develops a multi-order intention fusion model to achieve a more comprehensive understanding of both direct and indirect intention information. Within SocialMOIF, a trajectory distribution approximator is designed to guide the trajectories toward values that align more closely with the actual data, thereby enhancing model interpretability. Furthermore, a global trajectory optimizer is introduced to enable more accurate and efficient parallel predictions. By incorporating a novel loss function that accounts for distance and direction during training, experimental results demonstrate that the model outperforms previous state-of-the-art baselines across multiple metrics in both dynamic and static datasets.
Poster
Guillem Font Font · Antonio Rubio · Luis Ferraz · Antonio Agudo
[ ExHall D ]
Abstract
Multi-agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real-world applications such as correct tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of uncertainty. Moreover, popular multi-modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a unified diffusion model designed to handle trajectory completion while providing state-wise uncertainty estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log-likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post-processing to enable error probability estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state-of-the-art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), highlighting the effectiveness of uncertainty and error probability estimation.
Poster
Greg Heinrich · Mike Ranzinger · Danny Yin · Yao Lu · Jan Kautz · Bryan Catanzaro · Andrew Tao · Pavlo Molchanov
[ ExHall D ]
Abstract
Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM. This strategy enables the creation of robust models more efficiently, combining the strengths of individual teachers while significantly reducing computational and resource demands. In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, weak initializations, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. Specifically, in the context of Vision Language Models, we introduce a token compression technique to maintain high-resolution information within a fixed token count. We release our top-performing models, available in multiple scales (-B, -L, and -H), alongside code and pretrained weights, to support further research and development in the community.
Poster
Kwan-Yee Lin · Stella X. Yu
[ ExHall D ]
Abstract
Despite significant progress in humanoid robotics, research remains fragmented: low-level motor skill learning often disregards the influence of long-horizontal goals on current movement and lacks situational awareness. While, high-level navigation struggles to accommodate real-world constraints and adapt to the irregularity of local terrains, falling short in last-step feasibility. To bridge these gaps, we present LEGO-H, a universal learning framework that trains humanoid robots to become expert hikers on complex trails by developing and integrating skills across all levels, embracing physical embodiment through both visual perceptual awareness and body dynamics. At the heart of LEGO-H's designs is the harmonization of robots' visual perception, decision-making, and motor skill execution -- grounded in the new perspectives on the Hierarchical Reinforcement Learning (HRL) framework and the knowledge transfer process of privileged learning. Our key innovations include: (1) TC-ViTs, a Temporal Vision Transformer variant tailored into HRL, framing local navigation as a sequential hallucination task, softly guiding locomotion policy learning. This design seamlessly grafts locomotion and goal navigation into a unified, end-to-end policy learning framework. (2) Hierarchical Loss Metric for Policy Distillation. To ensure the varsities of motor skills, LEGO-H harnesses the power of privileged learning. However, humanoid robots are highly articulated, where rationality of …
Poster
Jinliang Zheng · Jianxiong Li · Dongxiu Liu · Yinan Zheng · Zhihao Wang · Zhonghong Ou · Yu Liu · Jingjing Liu · Ya-Qin Zhang · Xianyuan Zhan
[ ExHall D ]
Abstract
Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-embodiment data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions.
Poster
Shibo Zhao · Sifan Zhou · Raphael Blanchard · Yuheng Qiu · Wenshan Wang · Sebastian Scherer
[ ExHall D ]
Abstract
Despite recent advances in deep learning, most existing learning IMU odometry methods are trained on specific datasets, lack generalization, and are prone to overfitting, which limits their real-world application. To address these challenges, we present Tartan IMU, a foundation model designed for generalizable, IMU-based state estimation across diverse robotic platforms.Our approach consists of three-stage: First, a pre-trained foundation model leverages over 100 hours of multi-platform data to establish general motion knowledge, achieving 36\% improvement in ATE over specialized models. Second, to adapt to previously unseen tasks, we employ the Low-Rank Adaptation (LoRA), allowing positive transfer with only 1.1 M trainable parameters. Finally, to support robotics deployment, we introduce online test-time adaptation, which eliminates the boundary between training and testing, allowing the model to continuously "learn as it operates" at 200 FPS in real-time.
Poster
Shengyi Qian · Kaichun Mo · Valts Blukis · David Fouhey · Dieter Fox · Ankit Goyal
[ ExHall D ]
Abstract
Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research.
Poster
Haifeng Huang · Xinyi Chen · Yilun Chen · Hao Li · Xiaoshen Han · zehan wang · Tai Wang · Jiangmiao Pang · Zhou Zhao
[ ExHall D ]
Abstract
Recent advancements in robot manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, enabling low-level policies to accurately interpret spatial information, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic policy that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with featuring a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
Poster
Jiaming Zhou · Teli Ma · Kun-Yu Lin · Zifan Wang · Ronghe Qiu · Junwei Liang
[ ExHall D ]
Abstract
Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks.To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner.Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements.These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models.Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.We will release the code and models.
Poster
Quanyuan Ruan · Jiabao Lei · Wenhao Yuan · Yanglin Zhang · Dekun Lu · Guiliang Liu · Kui Jia
[ ExHall D ]
Abstract
Differentiable rendering has gained significant attention in the field of robotics, with differentiable robot rendering emerging as an effective paradigm for learning robotic actions from image-space supervision. However, the lack of physical world perception in this approach may lead to potential collisions during action optimization. In this work, we introduce a novel improvement on previous efforts by incorporating physical awareness of collisions through the learning of a neural robotic collision classifier. This enables the optimization of actions that avoid collisions with static, non-interactable environments as well as the robot itself. To facilitate effective gradient optimization with the classifier, we identify the underlying issue and propose leveraging Eikonal regularization to ensure consistent gradients for optimization. Our solution can be seamlessly integrated into existing differentiable robot rendering frameworks, utilizing gradients for optimization and providing a foundation for future applications of differentiable rendering in robotics with improved reliability of interactions with the physical world. Both qualitative and quantitative experiments demonstrate the necessity and effectiveness of our method compared to previous solutions.
Poster
Yuanqi Yao · Siao Liu · Haoming Song · Delin Qu · Qizhi Chen · Yan Ding · Bin Zhao · Zhigang Wang · Dong Wang · Xuelong Li
[ ExHall D ]
Abstract
Learning a generalist robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in maintaining knowledge across skills, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to model primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are concatenated and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL's superior performance over state-of-the-art methods. Code and dataset will be released upon acceptance.
Poster
Yiming Zhong · Qi Jiang · Jingyi Yu · Yuexin Ma
[ ExHall D ]
Abstract
A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.
Poster
Yuxing Long · Jiyao Zhang · Mingjie Pan · Tianshu Wu · Taewhan Kim · Hao Dong
[ ExHall D ]
Abstract
Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave’s manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.
Poster
Sai Kumar Dwivedi · Dimitrije Antić · Shashank Tripathi · Omid Taheri · Cordelia Schmid · Michael J. Black · Dimitrios Tzionas
[ ExHall D ]
Abstract
Estimating the 3D pose and shape of interacting humans and objects from single in-the-wild images is important for mixed reality and robotics. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing work tackles these challenges by exploiting surface contact points on the body and object and using these to guide 3D reconstruction. Unfortunately, obtaining 3D contact annotations requires either expensive 3D ground truth or time-consuming manual labeling. Consequently, obtaining training data at scale is a challenge. We tackle this by developing a novel model called InteractVLM that harnesses the broad visual knowledge of large Visual-Language Models (VLMs). The problem is, however, that these large models do not directly “understand” 3D human-object contact. To address this, we exploit existing small datasets of 3D human-object interaction to fine-tune large models to understand contact. However, this is non-trivial, as such models reason “only” in 2D, while contact is inherently 3D. Thus, we introduce a novel “Render-Localize-Lift” module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. This lets InteractVLM infer 3D contacts for both …
Poster
Yujie Liang · Xiaobin Hu · Boyuan Jiang · Donghao Luo · Xu Peng · Kai WU · Chengming Xu · Wenhui Han · Taisong Jin · Chengjie Wang · Rongrong Ji
[ ExHall D ]
Abstract
Although diffusion-based image virtual try-on has made considerable progress, emerging approaches still struggle to effectively address the issue of hand occlusion (i.e., clothing regions occluded by the hand part), leading to a notable degradation of the try-on performance. To tackle this issue widely existing in real-world scenarios, we propose VTON-HandFit, leveraging the power of hand priors to reconstruct the appearance and structure for hand occlusion cases. Firstly, we tailor a Handpose Aggregation Net using the ControlNet-based structure explicitly and adaptively encoding the global hand and pose priors. Besides, to fully exploit the hand-related structure and appearance information, we propose Hand-feature Disentanglement Embedding module to disentangle the hand priors into the hand structure-parametric and visual-appearance features, and customize a masked cross attention for further decoupled feature embedding. Lastly, we customize a hand-canny constraint loss to better learn the structure edge knowledge from the hand template of model image. VTON-HandFit outperforms the baselines in qualitative and quantitative evaluations on the public dataset and our self-collected hand-occlusion Handfit-3K dataset particularly for the arbitrary hand pose occlusion cases in real-world scenarios. The code and dataset will be available.
Poster
Kaixin Fan · Pengfei Ren · Jingyu Wang · Haifeng Sun · Qi Qi · Zirui Zhuang · Jianxin Liao
[ ExHall D ]
Abstract
3D hand reconstruction is essential in non-contact human-computer interaction applications, but existing methods struggle with low-resolution images, which occur in slightly distant interactive scenes. Leveraging temporal information can mitigate the limitations of individual low-resolution images that lack detailed appearance information, thereby enhancing the robustness and accuracy of hand reconstruction. Existing temporal methods typically use joint features to represent temporal information, avoiding interference from redundant background information. However, joint features excessively disregard the spatial context of visual features, limiting hand reconstruction accuracy. We propose to integrate temporal joint features with visual features to construct a robust low-resolution visual representation. We introduce Triplane Features, a dense representation with 3D spatial awareness, to bridge the gap between the joint features and visual features that are misaligned in terms of representation form and semantics. Triplane Features are obtained by orthogonally projecting the joint features, embedding hand structure information into the 3D spatial context. Furthermore, we compress the spatial information of the three planes into a 2D dense feature thourgh Spatial-Aware Fusion to enhance the visual features. By using enhanced visual features enriched with temporal information for hand reconstruction, our method achieves competitive performance at much lower resolutions compared to state-of-the-art methods operating at high …
Poster
Li Zhang · mingliang xu · Jianan Wang · Qiaojun Yu · Lixin Yang · Yonglu Li · Cewu Lu · RujingWang · Liu Liu
[ ExHall D ]
Abstract
Garments are common in daily life and are important for embodied intelligence community. Current category-level garments pose tracking works focus on predicting point-wise canonical correspondence and learning a shape deformation in point cloud sequences. In this paper, motivated by the 2D warping space and shape prior, we propose GaPT-DAR, a novel category-level Garments Pose Tracking framework with integrated 2D Deformation And 3D Reconstruction function, which fully utilize 3D-2D projection and 2D-3D reconstruction to transform the 3D point-wise learning into 2D warping deformation learning. Specifically, GaPT-DAR firstly builds a Voting-based Project module that learns the optimal 3D-2D projection plane for maintaining the maximum orthogonal entropy during point projection. Next, a Garments Deformation module is designed in 2D space to explicitly model the garments warping procedure with deformation parameters. Finally, we build a Depth Reconstruction module to recover the 2D images into 3D warp field. We provide extensive experiments on VR-Folding dataset to evaluate our GaPT-DAR and the results show obvious improvements on most of the metrics compared to state-of-the-arts (i.e., GarmentNets and GarmentTracking). Codes will be made publicly available.
Poster
Dong Li · Wenqi Zhong · Wei Yu · Yingwei Pan · Dingwen Zhang · Ting Yao · Junwei Han · Tao Mei
[ ExHall D ]
Abstract
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention …
Poster
Shuhang Chen · Xianliang Huang · Zhizhou Zhong · Jihong Guan · Shuigeng Zhou
[ ExHall D ]
Abstract
3D anthropometric measurements have a variety of applications in industrial design and architecture (e.g. vehicle seating and cockpits), Clothing (e.g. military uniforms), Ergonomics (e.g. seating) and Medicine (e.g. nutrition and diabetes) etc. Therefore, there is a need for systems that can accurately extract human body measurements. Current methods estimate human body measurements from 3D scans, resulting in a heavy data collection burden. Moreover, minor variations in camera angle, distance, and body postures may significantly affect the measurement accuracy. In response to these challenges, this paper introduces a focused human body model for accurately extracting anthropometric measurements. Concretely, we design a Bypass Network based on CNN and ResNet architectures, which augments the frozen backbone SMPLer-X with additional feature extraction capabilities. On the other hand, to boost the efficiency of training a large-scale model, we integrate a dynamical loss function that automatically recalibrates the weights to make the network focus on targeted anthropometric parts. In addition, we construct a multimodal body measurement benchmark dataset consisting of depth, point clouds, mesh and corresponding body measurements to support model evaluation and future anthropometric measurement research. Extensive experiments on both open-source and the proposed human body datasets demonstrate the superiority of our approach over existing …
Poster
Jian Wang · Rishabh Dabral · Diogo Luvizon · Zhe Cao · Lingjie Liu · Thabo Beeler · Christian Theobalt
[ ExHall D ]
Abstract
This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.
Poster
Reyhaneh Hosseininejad · Megh Shukla · Saeed Saadatnejad · Mathieu Salzmann · Alex Alahi
[ ExHall D ]
Abstract
Human pose forecasting is inherently multimodal since multiple future motions exist for an observed pose sequence. However, learning this multimodality is challenging since the task is ill-posed. To address this issue, we propose an alternative paradigm to make the task well-posed. Additionally, while state-of-the-art methods predict multimodality, this is attained through a large volume of predictions obtained by oversampling. However, such an approach glosses over key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. Not only can MotionMap capture a variable number of modes per observation, but it also provides confidence measures for different modes. Further, MotionMap captures rare modes that are non-trivial to evaluate yet critical for robustness. Finally, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. We support our claims through multiple qualitative and quantitative experiments using …
Poster
Bin Ji · Ye Pan · zhimeng Liu · Shuai Tan · Xiaogang Jin · Xiaokang Yang
[ ExHall D ]
Abstract
Numerous researches on real-time motion generation primarily focus on kinematic aspects, often resulting in physically implausible outcomes. In this paper, we present POMP (P_hysics-cO_nsistent Human M_otion P_rior through Phase Manifolds"), a novel kinematics-based framework that synthesizes physically consistent motions by leveraging phase manifolds to align motion priors with physics constraints. POMP operates as a frame-by-frame autoregressive model with three core components: a diffusion-based kinematic module, a simulation-based dynamic module, and a phase encoding module. At each timestep, the kinematic module generates an initial target pose, which is subsequently refined by the dynamic module to simulate human-environment interactions. Although the physical simulation ensures adherence to physical laws, it may compromise the kinematic rationality of the posture. Consequently, directly using the simulated result for subsequent frame prediction may lead to cumulative errors. To address this, the phase encoding module performs semantic alignment in the phase manifold. Moreover, we present a pipeline in Unity for generating terrain maps and capturing full-body motion impulses from existing motion capture (MoCap) data. The collected terrain topology and motion impulse data facilitate the training of POMP, enabling it to robustly respond to underlying contactforces and applied dynamics. Extensive evaluations demonstrate the efficacy of POMP across various contexts, …
Poster
Zhanbo Huang · Xiaoming Liu · Yu Kong
[ ExHall D ]
Abstract
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
Poster
Mengqing Xue · Yifei Liu · Ling Guo · Shaoli Huang · Changxing Ding
[ ExHall D ]
Abstract
Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs. This paper’s code will be released.
Poster
Hua Yu · Weiming Liu · Gui Xu · Yaqing Hou · Yew-Soon Ong · Qiang Zhang
[ ExHall D ]
Abstract
Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process.In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis.DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with DivSDE. DSDFM is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training parameters.Through qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.
Poster
Nan Jiang · Hongjie Li · Ziye Yuan · Zimo He · Yixin Chen · Tengyu Liu · Yixin Zhu · Siyuan Huang
[ ExHall D ]
Abstract
Most text-guided motion editing methods cannot generate versatile motions as they rely on limited training triplets of original motion, edited motion, and editing instruction, which fail to cover the vast combinations of possible edits. To address this challenge, we introduce MotionCutMix, a training technique that dynamically composes a huge amount of training triplets by blending body part motions based on editing instructions. However, this technique introduces increased randomness and potential body part incoordination in the generated motions. To model such rich distribution, we propose MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive strategy reduces the window size to facilitate convergence, while the motion coordinator mitigates the artifacts of motion composition. Our model handles both spatial and temporal edits without leveraging extra motion information or LLMs. We further contribute newly captured and re-annotated datasets for multiple motion editing tasks. Experimental results demonstrate that MotionReFit excels in text-guided motion edits, closely adhering to textual directives. Furthermore, ablation studies reveal that the incorporation of MotionCutMix during training enhances the generalizability of the trained model, and does not significantly hinder training convergence.
Poster
Haonan Han · Xiangzuo Wu · Huan Liao · Zunnan Xu · Zhongyuan Hu · Ronghui Li · Yachao Zhang · Xiu Li
[ ExHall D ]
Abstract
Recently, text-to-motion models open new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex, nuanced relationship between textual prompts and desired motion outcomes. To address this issue, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and the frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.
Poster
Boeun Kim · Hea In Jeong · JungHoon Sung · Yihua Cheng · Jeongmin Lee · Ju Yong Chang · Sang-Il Choi · YOUNGGEUN CHOI · Saim Shin · Jungho Kim · Hyung Jin Chang
[ ExHall D ]
Abstract
This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.
Poster
Hsin-Ping Huang · Yang Zhou · Jui-Hsien Wang · Difan Liu · Feng Liu · Ming-Hsuan Yang · Zhan Xu
[ ExHall D ]
Abstract
Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.
Poster
longbin ji · Lei Zhong · Pengfei Wei · Changjian Li
[ ExHall D ]
Abstract
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under large-angle rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, an open-domain, Pose-Aware video dragging model for reliable 3D-aligned animations from 2D trajectories. Our method incorporates a novel Two-Stage Pose-Aware Pretraining framework, improving 3D comprehension across diverse trajectories. Specifically, we 1) construct a large-scale synthetic dataset containing 10k videos of objects following rotational trajectories and 2) enhance the model perception of object pose changes by generating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on open-domain videos, applying additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark scenarios demonstrate that PoseTraj not only excels in 3D Pose-Aligned dragging for rotational scenarios but also outperforms existing baselines in trajectory accuracy and video quality.
Poster
Junhyeong Cho · Kim Youwang · Hunmin Yang · Tae-Hyun Oh
[ ExHall D ]
Abstract
Recent monocular 3D shape reconstruction methods have shown promising zero-shot results on object-segmented images without any occlusions. However, their effectiveness is significantly compromised in real-world settings, due to imperfect object segmentation by off-the-shelf models and the prevalence of occlusions. To address these issues, we propose a unified regression model that integrates segmentation and reconstruction, specifically designed for occlusion-aware 3D shape reconstruction. To facilitate its reconstruction in the wild, we also introduce a scalable data synthesis pipeline that simulates a wide range of variations in objects, occluders, and backgrounds. Training on our synthesized data enables the proposed model to achieve state-of-the-art zero-shot results on real-world images, using significantly fewer model parameters than competing approaches. Our code and data would be publicly available.
Poster
Yiqing Liang · Abhishek Badki · Hang Su · James Tompkin · Orazio Gallo
[ ExHall D ]
Abstract
Foundation models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such model exists for scene flow.Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well.We solve three challenges to fix this problem.First, we create a method that jointly estimates geometry and motion for accurate prediction.Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes.Third, we evaluate different parameterizations for scene flow prediction and identify a natural and effective parameterization.Our resulting model outperforms existing methods as well baselines built on foundation models in term of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP.Overall, this makes scene flow prediction significantly more practical for in-the-wild use.
Poster
Yung-Hao Yang · Zitang Sun · Taiki Fukiage · Shin'ya Nishida
[ ExHall D ]
Abstract
As AI models are increasingly integrated into applications involving human interaction, understanding the alignment between human perception and machine vision has become essential. One example is the estimation of visual motion (optical flow) in dynamic applications such as driving assistance. While there are numerous optical flow datasets and benchmarks with ground truth information, human-perceived flow in natural scenes remains underexplored. We introduce HuPerFlow—a benchmark for human-perceived flow, measured at 2,400 locations across ten optical flow datasets, with \~38,400 response vectors collected through online psychophysical experiments. Our data demonstrate that human-perceived flow aligns with ground truth in spatiotemporally smooth locations while also showing systematic errors influenced by various environmental properties. Additionally, we evaluated several optical flow algorithms against human-perceived flow, uncovering both similarities and unique aspects of human perception in complex natural scenes. HuPerFlow is the first large-scale human-perceived flow benchmark for alignment between computer vision models and human perception, as well as for scientific exploration of human motion perception in natural scenes. The HuPerFlow benchmark will be available online upon acceptance.
Poster
Zihang Lai · Andrea Vedaldi
[ ExHall D ]
Abstract
Temporal consistency is critical in video prediction. Traditional methods, such as temporal attention mechanisms and 3D convolutions, often struggle with significant object movements and fail to capture long-range temporal dependencies in dynamic scenes. To address these limitations, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks — sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. Empirical evaluations on standard video estimation benchmarks demonstrate that models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baseline models.
Poster
Edward LOO · Tianyu HUANG · Peng Li · Zhiyang Dou · Cheng Lin · Zhiming Cui · Zhen Dong · Sai-Kit Yeung · Wenping Wang · Yuan Liu
[ ExHall D ]
Abstract
Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
Poster
Sili Chen · Hengkai Guo · Shengnan Zhu · Feihu Zhang · Zilong Huang · Jiashi Feng · Bingyi Kang
[ ExHall D ]
Abstract
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different …
Poster
Jiahao Shao · Yuanbo Yang · Hongyu Zhou · Youmin Zhang · Yujun Shen · Vitor Guizilini · Yue Wang · Matteo Poggi · Yiyi Liao
[ ExHall D ]
Abstract
This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that no contextual information shared between frames or clips is pivotal in fostering inconsistency. Instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, We design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth.
Poster
Huiwon Jang · Sihyun Yu · Jinwoo Shin · Pieter Abbeel · Younggyo Seo
[ ExHall D ]
Abstract
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128×128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames …
Poster
Shuwei Shi · Biao Gong · Xi Chen · DanDan Zheng · Shuai Tan · Zizheng Yang · Yuyuan Li · Jingwen He · Kecheng Zheng · Jingdong Chen · Ming Yang · Yinqiang Zheng
[ ExHall D ]
Abstract
The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion …
Poster
Sherwin Bahmani · Ivan Skorokhodov · Guocheng Qian · Aliaksandr Siarohin · Willi Menapace · Andrea Tagliasacchi · David B. Lindell · Sergey Tulyakov
[ ExHall D ]
Abstract
Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to ≈4× reduction of training parameters, improved training speed and ≈10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20k in-the-wild dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound …
Poster
Kaihua Chen · Deva Ramanan · Tarasha Khurana
[ ExHall D ]
Abstract
Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, thereby capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.
Poster
Juan Luis Gonzalez Bello · Xu Yao · Alex Whelan · Kyle Olszewski · Hyeongwoo Kim · Pablo Garrido
[ ExHall D ]
Abstract
We present an implicit video representation for occlusions, appearance, and motion disentanglement from monocular videos, which we refer to as Video Spatiotemporal Splines (VideoSPatS).Unlike previous methods that map time and coordinates to deformation and canonical colors, our VideoSPatS maps input coordinates into Spatial and Color Spline deformation fields Ds and Dc, which disentangle motion and appearance in videos. With spline-based parametrization, our method naturally generates temporally consistent flow and guarantees long-term temporal consistency, which is crucial for convincing video editing.Aided by additional prediction blocks, our VideoSPatS also performs layer separation between the latent video and the selected occluder. By disentangling occlusions, appearance, and motion, our method allows for better spatiotemporal modeling and editing of diverse videos, including in-the-wild talking head videos with challenging occlusions, shadows, and specularities while maintaining a reasonable canonical space for editing.We also present general video modeling results on the DAVIS, and CoDeF datasets, as well as our own talking head video dataset collected from open-source web videos. Extensive ablations show the combination of Ds and Dc under neural splines can overcome motion and appearance ambiguities, paving the way to more advanced video editing models.
Poster
Alexander Pondaven · Aliaksandr Siarohin · Sergey Tulyakov · Philip H.S. Torr · Fabio Pizzati
[ ExHall D ]
Abstract
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation. Our code will be open source.
Poster
Yuchi Wang · Junliang Guo · Xinyi Xie · Tianyu He · Xu Sun · Jiang Bian
[ ExHall D ]
Abstract
Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20\% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.
Poster
Maria Pilligua · Danna Xue · Javier Vazquez-Corral
[ ExHall D ]
Abstract
Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos.
Poster
Yang Hai · Guo Wang · Tan Su · jerett · Yinlin Hu
[ ExHall D ]
Abstract
We present an efficient diffusion-based method for video frame interpolation. Most recent diffusion-based methods still have a large gap from non-diffusion methods in accuracy and efficiency. The key of our method is, instead of formulating the problem as a denoising procedure in the latent space directly, which is less effective caused by the large latent space, we propose to model optical flow explicitly from coarse to fine by a hierarchical diffusion models, which has much smaller search space in each denoising step, and can handle complex motions and large displacements. Extensive evaluation on multiple benchmarks demonstrates that our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods.
Poster
Ding Ding · Yueming Pan · Ruoyu Feng · Qi Dai · Kai Qiu · Jianmin Bao · Chong Luo · Zhenzhong Chen
[ ExHall D ]
Abstract
In this paper, we present HomoGen, an enhanced video inpainting method based on homography propagation and diffusion models. HomoGen leverages homography registration to propagate contextual pixels as priors for generating missing content in corrupted videos. Unlike previous flow-based propagation methods, which introduce local distortions due to point-to-point optical flows, homography-induced artifacts are typically global structural distortions that preserve semantic integrity. To effectively utilize these priors for generation, we employ a video diffusion model that inherently prioritizes semantic information within the priors over pixel-level details. A content-adaptive control mechanism is proposed to scale and inject the priors into intermediate video latents during iterative denoising. In contrast to existing transformer-based networks that often suffer from artifacts within priors, leading to error accumulation and unrealistic results, our denoising diffusion network can smooth out artifacts and ensure natural output. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively.
Poster
Tianwei Yin · Qiang Zhang · Richard Zhang · William Freeman · Fredo Durand · Eli Shechtman · Xun Huang
[ ExHall D ]
Abstract
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address these limitations by introducing an autoregressive diffusion transformer that is adapted from a pretrained bidirectional video diffusion model. Our key innovations are twofold: First, we extend distribution matching distillation (DMD) to videos, compressing a 50-step denoising process into just 4 steps. Second, we develop an asymmetric distillation approach where a causal student model learns from a bidirectional teacher with privileged future information. This strategy effectively mitigates error accumulation in autoregressive generation, enabling high-quality long-form video synthesis despite training on short clips. Our model achieves a total score of 82.85 on VBench-Long, outperforming all published approaches and, mostly importantly, uniquely enabling fast streaming inference on single GPU at 9.4 FPS. Our method also supports streaming video editing, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
Poster
Shuyun Wang · Hu Zhang · Xin Shen · Dadong Wang · Xin Yu
[ ExHall D ]
Abstract
Bitstream-corrupted video recovery aims to fill in realistic video content due to bitstream corruption during video storage or transmission.Most existing methods typically assume that the predefined masks of the corrupted regions are known in advance.However, manually annotating these input masks is laborious and time-consuming, limiting the applicability of existing methods in real-world scenarios. Therefore, we expect to relax this assumption by defining a new blind video recovery setting where the recovery of corrupted regions does not rely on predefined masks.There are two primary challenges in this scenario: (i) without predefined masks, how accurately can a model identify the regions requiring recovery?(ii) how to recover extensive and irregular contents, especially when large portions of frames are severely degraded or large-scale corrupted?To address these challenges, we introduce a Metadata-Guided Diffusion Model, dubbed M-GDM.To enable a diffusion model to focus on the corrupted regions, we leverage inherent video metadata as a corruption indicator and design a dual-stream metadata encoder.This encoder first processes the motion vectors and frame types of a video separately, and then merges them into a unified metadata representation.The metadata representation will interact with the corrupted latent feature via cross-attention in each diffusion step.Meanwhile, to preserve the intact regions, we propose …
Poster
Qian Wang · Abdelrahman Eldesokey · Mohit Mendiratta · Fangneng Zhan · Adam Kortylewski · Christian Theobalt · Peter Wonka
[ ExHall D ]
Abstract
We introduce the first training-free approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to a high-quality full resolution. Experiments show that our proposed …
Poster
Yue-Hua Han · Tai-Ming Huang · Kailung Hua · Jun-Cheng Chen
[ ExHall D ]
Abstract
Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. While research in Deepfake detection has advanced rapidly, many methods still struggle to generalize to unseen Deepfakes generated by novel synthesis techniques. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues based on the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce the Facial Component Guidance (FCG) to enhance the spatial learning generalizability by encouraging the model to focus on key facial regions. The cross-dataset evaluation demonstrates the superior performance of our approach, surpassing state-of-the-art methods on challenging datasets. Extensive experiments further validate the effectiveness of the proposed method in terms of data efficiency, parameter efficiency and model robustness.
Poster
Zhenxuan Fang · Fangfang Wu · Tao Huang · Le Dong · Weisheng Dong · Xin Li · Guangming Shi
[ ExHall D ]
Abstract
Unlike global motion blur, Local Motion Deblurring (LMD) presents a more complex challenge, as it requires precise restoration of blurry regions while preserving the sharpness of the background. Existing LMD methods rely on manually annotated blur masks and often overlook the blur kernel's characteristics, which are crucial for accurate restoration. To address these limitations, we propose a novel parameterized motion kernel modeling approach that defines the motion blur kernel with three key parameters: length, angle, and curvature. We then use networks to estimate these kernel parameters, significantly improving the accuracy of blur kernel estimation. To effectively learn the motion blur representation, we incorporate a shared memory bank that stores blur prior information. Additionally, we introduce a dual-branch deblurring network: one branch leverages Mamba to capture long-range dependencies, while the other uses a mask-guided CNN focused on refining the local blurry regions. By fully utilizing the estimated bur prior information, our approach greatly enhances deblurring outcomes. Experimental results show that our method achieves state-of-the-art performance both quantitatively and visually, with a substantial reduction in computational complexity.
Poster
Nicolas Dufour · Vicky Kalogeiton · David Picard · Loic Landrieu
[ ExHall D ]
Abstract
Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.
Poster
Shasha Mao · Shiming Lu · Zhaolong Du · Licheng Jiao · Shuiping Gou · Luntian Mou · Xuequan Lu · Lin Xiong · Yimeng Zhang
[ ExHall D ]
Abstract
Synthetic Aperture Radar (SAR) image registration is an essential upstream task in geoscience applications, in which pre-detected keypoints from two images are employed as observed objects to seek matched-point pairs. In general, the registration is regarded as a typical closed-set classification, which forces each keypoint to be classified into the given classes, but ignoring an essential issue that numerous redundant keypoints are beyond the given classes, which unavoidably results in capturing incorrect matched-point pairs. Based on this, we propose a Cross-Rejective Open-set SAR Image Registration (CroR-OSIR) method. In this work, these redundant keypoints are regarded as out-of-distribution (OOD) samples, and we formulate the registration as a special open-set task with two modules: supervised contrastive feature-tuning and cross-rejective open-set recognition (CroR-OSR). Different from traditional open set recognition, all samples including OOD samples are available in the CroR-OSR module. CroR-OSR conducts the closed-set classifications in individual open-set domains from two images, meanwhile employing the cross-domain rejection during training, to exclude these OOD samples based on confidence and consistency. Moreover, a new supervised contrastive tuning strategy is incorporated for feature-tuning. Especially, the cross-domain estimation labels obtained by CroR-OSR are fed back to the feature-tuning module for feature-tuning, to enhance feature discriminability. Experimental results …
Poster
Zichen Tian · Yaoyao Liu · Qianru Sun
[ ExHall D ]
Abstract
Training large foundation models of remote-sensing (RS) images is almost impossible due to the limited and long-tailed data problems. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters -- such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly. Our code is available in the supplementary materials for review.
Poster
Jingtao Li · Yingyi Liu · XINYU WANG · Yunning Peng · Chen Sun · Shaoyu Wang · Zhendong Sun · Tian Ke · Xiao Jiang · Tangwei Lu · Anran Zhao · Yanfei Zhong
[ ExHall D ]
Abstract
Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels, existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from 0.4∼2.5μm, supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 datasets. Code would be accessible at XXXX.
Poster
Jiangwei Ren · Xingyu Jiang · Zizhuo Li · Dingkang Liang · Xin Zhou · Xiang Bai
[ ExHall D ]
Abstract
Image matching for both cross-view and cross-modality plays a critical role in multi-modal perception. Due to the modality gap caused by different imaging systems/styles, the matching task poses great challenges. Existing works try to extract invariant features for specific modality, and train on limited dataset, showing poor generalization. To this end, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance the universal performance from the perspective of data scaling-up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate labeling. Specifically, we scale-up the modalities from cheap but rich RGB-only matching data, by means of generative modules. With this setting, the matching labels and rich diversity of RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multi-modal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modality ability. Extensive experiments on synthetic and real datasets demonstrate that our MINIMA can achieve large enhancement for cross-modal …
Poster
Sungpyo Kim · Jeonghyeok Do · Jaehyup Lee · Munchurl Kim
[ ExHall D ]
Abstract
Conventional methods for PAN-sharpening often struggle to restore fine details due to limitations in leveraging high-frequency information. Moreover, diffusion-based approaches lack sufficient conditioning to fully utilize Panchromatic (PAN) images and low-resolution multispectral (LRMS) inputs effectively. To address these challenges, we propose an uncertainty-aware knowledge distillation diffusion framework with details enhancement for PAN-sharpening, called U-Know-DiffPAN. The U-Know-DiffPAN incorporates uncertainty-aware knowledge distillation for effective transfer of feature details from our teacher model to a student one. The teacher model in our U-Know-DiffPAN captures frequency details through freqeuncy selective attention, facilitating accurate reverse process learning. By conditioning the encoder on compact vector representations of PAN and LRMS and the decoder on Wavelet transforms, we enable rich frequency utilization. So, the high-capacity teacher model distills frequency-rich features into a lightweight student model aided by an uncertainty map. From this, the teacher model can guide the student model to focus on difficult image regions for PAN-sharpening via the usage of the uncertainty map. Extensive experiments on diverse datasets demonstrate the robustness and superior performance of our U-Know-DiffPAN over very recent state-of-the-art PAN-sharpening methods. The source code is available at https://github.com/xxx/yyy.
Poster
Xin Di · Long Peng · Peizhe Xia · Wenbo Li · Renjing Pei · Yang Wang · Yang Cao · Zheng-Jun Zha
[ ExHall D ]
Abstract
Burst super-resolution (BurstSR) aims to reconstruct high-resolution images by fusing subpixel details from multiple low-resolution burst frames. The primary challenge lies in effectively extracting useful information while mitigating the impact of high-frequency noise. Most existing methods rely on frame-by-frame fusion, which often struggles to distinguish informative subpixels from noise, leading to suboptimal performance. To address these limitations, we introduce a novel Query Mamba Burst Super-Resolution (QMambaBSR) network. Specifically, we observe that sub-pixels have consistent spatial distribution while noise appears randomly. Considering the entire burst sequence during fusion allows for more reliable extraction of consistent subpixels and better suppression of noise outliers. Based on this, a Query State Space Model (QSSM) is proposed for both inter-frame querying and intra-frame scanning, enabling a more efficient fusion of useful subpixels. Additionally, to overcome the limitations of static upsampling methods that often result in over-smoothing, we propose an Adaptive Upsampling (AdaUp) module that dynamically adjusts the upsampling kernel to suit the characteristics of different burst scenes, achieving superior detail reconstruction. Extensive experiments on four benchmark datasets—spanning both synthetic and real-world images—demonstrate that QMambaBSR outperforms existing state-of-the-art methods. The code will be publicly available.
Poster
Ruiyi Wang · Yushuo Zheng · Zicheng Zhang · Chunyi Li · Shuaicheng Liu · Guangtao Zhai · Xiaohong Liu
[ ExHall D ]
Abstract
Existing real-world image dehazing methods typically attempt to fine-tune pre-trained models or adapt their inference procedures, placing significant reliance on the quality of pre-training data. Although generative diffusion models have shown potential in restoring heavily distorted information, their application in dehazing remains constrained due to extensive sampling steps and fidelity limitations. To address these challenges, we propose a two-stage hazing-dehazing pipeline, which integrates the Realistic Haze Generation Framework (HazeGen) and the Diffusion-based Dehazing Framework (DiffDehaze). Specifically, HazeGen takes advantage of the rich generative diffusion prior of real-world hazy images embedded in the pre-trained text-to-image diffusion model and leverages IRControlNet to realize conditional generation. To further improve haze authenticity and generation diversity, HazeGen utilizes the hybrid training and the blended sampling approaches to generate high-quality training data for DiffDehaze. In order to leverage generative capacity while retaining efficiency, DiffDehaze employs the Accelerated Fidelity-Preserving Sampling Strategy (AccSamp). With a Patch-based Statistical Alignment Operation (AlignOp), DiffDehaze can quickly generate a faithful dehazing estimate in few sampling steps, which can be used to reduce sampling steps and enables a haze density-aware fidelity guidance. Extensive visual comparisons and quantitative evaluations demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The …
Poster
Zeyu Mi · Yu-Bin Yang
[ ExHall D ]
Abstract
Data augmentation (DA) stands out as a powerful technique to enhance the generalization capabilities of deep neural networks across diverse tasks. However, in low-level vision tasks, DA remains rudimentary (i.e., vanilla DA), facing a critical bottleneck due to information loss. In this paper, we introduce a novel Calibrated Attribution Map (CAM) to generate saliency masks, followed by two saliency-based DA methods—ADD and ADD+—designed to address this issue. CAM leverages integrated gradients and incorporates two key innovations: a global feature detector and calibrated integrated gradients. Based on CAM and the proposed methods, we highlight two key insights for low-level vision tasks: (1) increasing pixel diversity, as seen in vanilla DA, can improve performance, and (2) focusing on salient features while minimizing the impact of irrelevant pixels, as seen in saliency-based DA, more effectively enhances model performance. Additionally, we propose two guiding principles for designing saliency-based DA: coarse-grained partitioning and diverse augmentation strategies. Extensive experiments demonstrate the compatibility and consistent, significant performance improvement of our method across various SR tasks and networks.
Poster
Heemin Yang · Jaesung Rim · Seungyong Lee · Seung-Hwan Baek · Sunghyun Cho
[ ExHall D ]
Abstract
In this paper, we present GyroDeblurNet, a novel single image deblurring method that utilizes a gyro sensor to effectively resolve the ill-posedness of image deblurring.The gyro sensor provides valuable information about camera motion that can significantly improve deblurring quality.However, effectively exploiting real-world gyro data is challenging due to significant errors from various sources.To handle these errors, GyroDeblurNet is equipped with two novel neural network blocks: a gyro refinement block and a gyro deblurring block.The gyro refinement block refines the erroneous gyro data using the blur information from the input image.The gyro deblurring block removes blur from the input image using the refined gyro data and further compensates for gyro error by leveraging the blur information from the input image.For training a neural network with erroneous gyro data, we propose a training strategy based on the curriculum learning.We also introduce a novel gyro data embedding scheme to represent real-world intricate camera shakes.Finally, we present both synthetic and real-world datasets for training and evaluating gyro-based single image deblurring.Our experiments demonstrate that our approach achieves state-of-the-art deblurring quality by effectively utilizing erroneous gyro data.
Poster
Yidi Liu · Dong Li · Xueyang Fu · Xin Lu · Jie Huang · Zheng-Jun Zha
[ ExHall D ]
Abstract
We introduce UHD-Processor, a unified and robust framework for all-in-one image restoration, which is particularly resource-efficient for Ultra-High-Definition (UHD) images. To address the limitations of traditional all-in-one methods that rely on complex restoration backbones, our strategy employs a frequency domain decoupling progressive learning technique, motivated by curriculum learning, to incrementally learn restoration mappings from low to high frequencies. This approach incorporates specialized sub-network modules to effectively tackle different frequency bands in a divide-and-conquer manner, significantly enhancing the learning capability of simpler networks. Moreover, to accommodate the high-resolution characteristics of UHD images, we developed a variational autoencoder (VAE)-based framework that reduces computational complexity by modeling a concise latent space. It integrates task-specific degradation awareness in the encoder and frequency selection in the decoder, enhancing task comprehension and generalization. Our unified model is able to handle various degradations such as denoising, deblurring, dehazing, low-lighting, etc. Experimental evaluations extensively showcase the effectiveness of our dual-strategy approach, significantly improving UHD image restoration and achieving cutting-edge performance across diverse conditions.
Poster
Yuheng Xu · Shijie Yang · Xin Liu · Jie Liu · Jie Tang · Gangshan Wu
[ ExHall D ]
Abstract
In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are dynamically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network’s ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets . For SPF-LUT, with more than a 50\% reduction in storage …
Poster
Kangfu Mei · Vishal M. Patel · Mojtaba Sahraee-Ardakan · Hossein Talebi · Peyman Milanfar · Mauricio Delbracio
[ ExHall D ]
Abstract
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity.
Poster
Zongsheng Yue · Kang Liao · Chen Change Loy
[ ExHall D ]
Abstract
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a \textit{Partial noise Prediction} strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model will be made publicly available.
Poster
Bingliang Zhang · Wenda Chu · Julius Berner · Chenlin Meng · Anima Anandkumar · Yang Song
[ ExHall D ]
Abstract
Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems.
Poster
Marina Alterman · Anat Levin
[ ExHall D ]
Abstract
Transmission matrices, mapping the propagation of light from one end of the tissue to the other, form an important mathematical tool in the analysis of tissue scattering and the design of wavefront shaping systems. To understand the relationship between their content and the volumetric structure of the tissue, we wish to fit them with multi-slice models, composed of a set of planar aberrations spaced throughout the volume. The number of layers used in such a model would largely affect the amount of information compression and the ease in which we can use such layered models in a wavefront-shaping system. This work offers a theoretical study of such multi-layered models. We attempt to understand how many layers are required for a good fit, and how does the approximation degrade when a smaller number of such layers is used. We show analytically that transmission matrices can be well fitted with very sparse layers. This leads to optimistic predictions on our ability to use them to design future wavefront shaping systems which can correct tissue aberration over a wide field-of-view.
Poster
linwei dong · Qingnan Fan · Yihong Guo · Zhonghao Wang · Qi Zhang · Jinwei Chen · Yawei Luo · Changqing Zou
[ ExHall D ]
Abstract
Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.
Poster
Matthieu Terris · Ulugbek Kamilov · Thomas Moreau
[ ExHall D ]
Abstract
Selecting an appropriate prior to compensate for information loss due to the measurement operator is a fundamental challenge in imaging inverse problems. Implicit priors based on denoising neural networks have become central to widely-used frameworks such as Plug-and-Play (PnP) algorithms. In this work, we introduce Fixed-points of Restoration (FiRe) priors as a new framework for expanding the notion of priors in PnP to general restoration models beyond traditional denoising models. The key insight behind FiRe is that natural images emerge as fixed points of the composition of a degradation operator with the corresponding restoration model. This enables us to derive an explicit formula for our implicit prior by quantifying invariance of images under this composite operation. Adopting this fixed-point perspective, we show how various restoration networks can effectively serve as priors for solving inverse problems. The FiRe framework further enables ensemble-like combinations of multiple restoration models as well as acquisition-informed restoration networks, all within a unified optimization approach. Experimental results validate the effectiveness of FiRe across various inverse problems, establishing a new paradigm for incorporating pretrained restoration models into PnP-like algorithms.
Poster
Junyuan Deng · Xinyi Wu · Yongxing Yang · Congchao Zhu · Song Wang · Zhenyao Wu
[ ExHall D ]
Abstract
Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.
Poster
Chong Wang · Lanqing Guo · Zixuan Fu · SIYUAN YANG · Hao Cheng · Alex C. Kot · Bihan Wen
[ ExHall D ]
Abstract
Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned discriminative denoiser as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained generative diffusion model, has gained great popularity for solving IR problems through stochastic sampling. The IR results using PnP with a pre-trained diffusion model demonstrate distinct advantages compared to those using discriminative denoisers, i.e., improved perceptual quality while sacrificing the data fidelity. The unsatisfactory results are due to the lack of integration of these strategies in the IR tasks.In this work, we propose a novel zero-shot IR scheme, dubbed Reconciling Diffusion Model in Dual (RDMD), which leverages only a single pre-trained diffusion model to construct two complementary regularizers.Specifically, the diffusion model in RDMD will iteratively perform deterministic denoising and stochastic sampling, aiming to achieve high-fidelity image restoration with appealing perceptual quality.RDMD also allows users to customize the distortion-perception tradeoff with a single hyperparameter, enhancing the adaptability of the restoration process in different practical scenarios.Extensive experiments on several IR tasks demonstrate that our proposed method could achieve superior results compared to existing approaches on both the FFHQ and ImageNet datasets.We will release the …
Poster
Xinrui Wang · Lanqing Guo · Xiyu Wang · Siyu Huang · Bihan Wen
[ ExHall D ]
Abstract
Recent advancements in deep learning have yielded promising results for the image shadow removal task. However, most existing methods rely on binary pre-generated shadow masks. The binary nature of such masks could potentially lead to artifacts near the boundary between shadow and non-shadow areas. In view of this, inspired by the physical model of shadow formation, we introduce novel soft shadow masks specifically designed for shadow removal. To achieve such soft masks, we propose a SoftShadow framework by leveraging the prior knowledge of pretrained SAM and integrating physical constraints. Specifically, we jointly tune the SAM and the subsequent shadow removal network using penumbra formation constraint loss, mask reconstruction loss, and shadow removal loss. This framework enables accurate predictions of penumbra (partially shaded) and umbra (fully shaded) areas while simultaneously facilitating end-to-end shadow removal. Through extensive experiments on popular datasets, we found that our SoftShadow framework, which generates soft masks, can better restore boundary artifacts, achieve state-of-the-art performance, and demonstrate superior generalizability.
Poster
Xingyu Qiu · Mengying Yang · Xinghua Ma · Fanding Li · Dong Liang · Gongning Luo · wei wang · Kuanquan Wang · Shuo Li
[ ExHall D ]
Abstract
In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights fA(t) and fB(t), as they follow the paradigm (xt=fA(t)xImg+fB(t)ϵ). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced …
Poster
Yikai Wang · Chenjie Cao · Junqiu Yu · Ke Fan · Xiangyang Xue · Yanwei Fu
[ ExHall D ]
Abstract
Recent advances in image inpainting increasingly use generative models to handle large irregular masks. However, these models can create unrealistic inpainted images due to two main issues: (1) Context Instability: Even with unmasked areas as context, generative models may still generate arbitrary objects in the masked region that don’t align with the rest of the image. (2) Hue Inconsistency: Inpainted regions often have color shifts that causes a smeared appearance, reducing image quality.Retraining the generative model could help solve these issues, but it’s costly since state-of-the-art latent-based diffusion and rectified flow models require a three-stage training process: training a VAE, training a generative U-Net or transformer, and fine-tuning for inpainting.Instead, this paper proposes a post-processing approach, dubbed as ASUKA (Aligned Stable inpainting with UnKnown Areas prior), to improve inpainting models. To address context instability, we leverage a Masked Auto-Encoder (MAE) for reconstruction-based priors. This strengthens context alignment while maintaining the model's generation capabilities. To address hue inconsistency, we propose a specialized VAE decoder that treats latent-to-image decoding as a local harmonization task, significantly reducing color shifts for hue-consistent inpainting. We validate ASUKA on SD 1.5 and FLUX inpainting variants using the Places2 benchmark and MISATO, our proposed diverse collection of …
Poster
Zhe Zhang · Zhenzhong Chen · Shan Liu
[ ExHall D ]
Abstract
Neural lossless image compression methods have recently achieved impressive compression ratios by fitting neural networks to represent data distributions of large datasets. However, these methods often require complex networks to capture intricate data distributions effectively, resulting in high decoding complexity. In this paper, we present a novel approach named Fitted Neural Lossless Image Compression (FNLIC) that enhances efficiency through a two-phase fitting process. For each image, a latent variable model is overfitted to optimize the representation of the individual image's probability distribution, which is inherently simpler than the distribution of an entire dataset and requires less complex neural networks. Additionally, we pre-fit a lightweight autoregressive model on a comprehensive dataset to learn a beneficial prior for overfitted models. To improve coordination between the pre-fitting and overfitting phases, we introduce independent fitting for the pre-fitter and the adaptive prior transformation for the overfitted model. Extensive experimental results on high-resolution datasets show that FNLIC achieves competitive compression ratios compared to both traditional and neural lossless image compression methods, with decoding complexity significantly lower than other neural methods of similar performance. The code will be made publicly available upon publication.
Poster
Jona Ballé · Luca Versari · Emilien Dupont · Hyunjik Kim · Matthias Bauer
[ ExHall D ]
Abstract
Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply–accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
Poster
Xuewen Liu · Zhikai Li · Qingyi Gu
[ ExHall D ]
Abstract
Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18× speedup and 4× compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in …
Poster
Qianli Ma · Xuefei Ning · Dongrui Liu · Li Niu · Linfeng Zhang
[ ExHall D ]
Abstract
Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a Decouple-then-Merge (DeMe) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10. Code is included in the supplementary material and will be released on Github.
Poster
Youyuan Zhang · Zehua Liu · Zenan Li · Zhaoyu Li · James Clark · Xujie Si
[ ExHall D ]
Abstract
In this paper, we consider the conditional generation problem by guiding off-the-shelf unconditional diffusion models with differentiable loss functions in a plug-and-play fashion. While previous research has primarily focused on balancing the unconditional diffusion model and the guided loss through a tuned weight hyperparameter, we propose a novel framework that distinctly decouples these two components. Specifically, we introduce two variables x and z, to represent the generated samples governed by the unconditional generation model and the guidance function, respectively. This decoupling reformulates conditional generation into two manageable subproblems, unified by the constraint x=z. Leveraging this setup, we develop a new algorithm based on the Alternating Direction Method of Multipliers (ADMM) to adaptively balance these components. Additionally, we establish the equivalence between the diffusion reverse step and the proximal operator of ADMM and provide a detailed convergence analysis of our algorithm under certain mild assumptions. Our experiments demonstrate that our proposed method \OurMethod{} consistently generates high-quality samples while ensuring strong adherence to the conditioning criteria. It outperforms existing methods across a range of conditional generation tasks, including image generation with various guidance and controllable motion synthesis.
Poster
Hao Lin · Ke Wu · Jie Li · Jun Li · Wu-Jun Li
[ ExHall D ]
Abstract
Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80× in throughput and reduces strategy optimization time by up to 107× across five Transformer-based models.
Poster
Mashrur M. Morshed · Vishnu Naresh Boddeti
[ ExHall D ]
Abstract
Many real-world applications of flow-based generative models desire a diverse set of samples covering multiple modes of the target distribution. However, the predominant approach for obtaining diverse sets is not sample-efficient, as it involves independently obtaining many samples from the source distribution and mapping them through the flow until the desired mode coverage is achieved. As an alternative to repeated sampling, we introduce DiverseFlow: a training-free, inference-time approach to improve the diversity of flow models. Our key idea is to employ a determinantal point process to induce a coupling between the samples that drives diversity under a fixed sampling budget. In essence, DiverseFlow enables exploring more variations in a learned flow model with a fewer number of samples. We demonstrate the efficacy of our method for tasks where sample efficient diversity is desirable, such as, text-guided image generation with polysemous words, inverse problems like large-hole inpainting, and class-conditional image synthesis.
Poster
Junhyuk So · Jiwoong Shin · Chaeyeon Jang · Eunhyeok Park
[ ExHall D ]
Abstract
Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM’s limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.
Poster
David McAllister · Matthew Tancik · Jiaming Song · Angjoo Kanazawa
[ ExHall D ]
Abstract
Large-scale AI model training divides work across thousands of GPUs then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework to distribute diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, they ensemble through a lightweight router. We show that this ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of compute islands,'' lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can …
Poster
Zigeng Chen · Xinyin Ma · Gongfan Fang · Xinchao Wang
[ ExHall D ]
Abstract
In the rapidly advancing field of image generation, *Visual Auto-Regressive* (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To overcome these bottlenecks, we propose *Collaborative Decoding* (CoDe), a novel decoding strategy tailored to the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration, reaching over 41 images/s at 256x256 …
Poster
Ye Chen · Zhangli Hu · Zhongyin Zhao · Yupeng Zhu · Yue Shi · Yuxuan Xiong · Bingbing Ni
[ ExHall D ]
Abstract
Current parameterized image representations embed visual information along the semantic boundaries and struggle to express the internal detailed texture structures of image components, leading to a lack of content consistency after image editing and driving. To address these challenges, this work proposes a novel parameterized representation based on hierarchical image proxy geometry, utilizing multi-layer hierarchically interrelated proxy geometric control points to embed multi-scale long-range structures and fine-grained texture details. The proposed representation enables smoother and more continuous interpolation during image rendering and ensures high-quality consistency within image components during image editing. Additionally, under the layer-wise representation strategy based on semantic-aware image layer decomposition, we enable decoupled image shape/texture editing of the targets of interest within the image. Extensive experimental results on image vectorization and editing tasks demonstrate that our proposed method achieves high rendering accuracy of general images, including natural images, with a significantly higher image parameter compression ratio, facilitating user-friendly editing of image semantic components.
Poster
Yael Vinker · Tamar Rott Shaham · Kristine Zheng · Alex Zhao · Judith Fan · Antonio Torralba
[ ExHall D ]
Abstract
Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions.Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks.By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
Poster
Xihua Wang · Ruihua Song · Chongxuan Li · Xin Cheng · Boyuan Li · Yihan Wu · Yuyue Wang · Hongteng Xu · Yunfeng Wang
[ ExHall D ]
Abstract
This paper addresses a promising yet underexplored task, Image-to-Sounding-Video (I2SV) generation, which animates a static image and generates synchronized sound simultaneously. Despite advances in video and audio generation models, some challenges remain to develop a unified model for generating naturally sounding videos. In this work, we propose a novel approach that leverages two separate pretrained diffusion models and makes vision and audio influence each other during generation based on the Diffusion Transformer (DiT) architecture. First, the individual video and audio generation models are decomposed into input, output, and expert sub-modules. We propose using a unified joint DiT block in the expert sub-modules to effectively model the interaction between the two modalities, resulting in high-quality I2SV generation. Then, we introduce a joint classifier-free guidance technique to boost the performance during joint generation.Finally, we conduct extensive experiments on three popular benchmark datasets, and in both objective and subjective evaluation our method surpass all the baseline methods in almost all metrics. Case studies show that our generated sounding videos are high quality and synchronized between video and audio.
Poster
Feng-Lin Liu · Hongbo Fu · Xintao Wang · Weicai Ye · Pengfei Wan · Di ZHANG · Lin Gao
[ ExHall D ]
Abstract
Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing. We will release our code after acceptance.
Poster
Dingkun Yan · Xinrui Wang · Zhuoru Li · Suguru Saito · Yusuke Iwasawa · Yutaka Matsuo · Jiaxian Guo
[ ExHall D ]
Abstract
Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-guided methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial reference and an RGB image as the color guidance, and separately extracts foreground and background information from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules for foreground and background separately trained to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the need to fine-tune model parameters. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes and trained models will …
Poster
Junyu Gao · Kunlin Yang · Xuan Yao · Yufan Hu
[ ExHall D ]
Abstract
Recently, text-driven video editing methods that optimize target latent representations have garnered significant attention and demonstrated promising results. However, these methods rely on self-supervised objectives to compute the gradients needed for updating latent representations, which inevitably introduces gradient noise, compromising content generation quality. Additionally, it is challenging to determine the optimal stopping point for the editing process, making it difficult to achieve an optimal solution for the latent representation. To address these issues, we propose a unified gradient-latent purification framework that collects gradient and latent information across different stages to identify effective and concordant update directions. We design a local coordinate system construction method based on feature decomposition, enabling short-term gradients and final-stage latents to be reprojected onto new axes. Then, we employ tailored coefficient regularization terms to effectively aggregate the decomposed information. Additionally, a temporal smoothing axis extension strategy is developed to enhance the temporal coherence of the generated content. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods across various editing tasks, delivering superior editing performance. Code is available in the Supplementary Material.
Poster
Zilyu Ye · Zhiyang Chen · Tiancheng Li · Zemin Huang · Weijian Luo · Guo-Jun Qi
[ ExHall D ]
Abstract
Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts.In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning to maximize the final image quality while discounting the number of denoising steps.With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
Poster
Ravishankar Evani · Deepu Rajan · Shangbo Mao
[ ExHall D ]
Abstract
Texture recognition has more recently relied on Neural Networks that are Convolution, Transformer and Graph based. However, many of these methods fail to effectively incorporate frequency characteristics exhibited by visual and latent texture attributes. In addition, effective orderless representation of textures before mapping from latent to visual texture attributes has not been fully explored. Finally, there is no loss function that has been designed specifically for texture and material recognition tasks. In this study, we introduce the Chebyshev Attention Depth Permutation Texture Network (CAPTN), which by using texture frequency attention mechanisms and convolution operations to generate latent texture attributes. These attributes are then enhanced by permuting the feature space. CAPTN then incorporates a non-linear learnable Chebyshev function to improve mapping of orderless enhanced latent texture attributes to visual texture attributes. Finally, we propose Latent Texture Attribute Loss to understanding spatial texture characteristics and enforce distributional consistency of orderless latent texture attribute representations. CAPTN allows end-to-end training without the need to fine-tune pre-trained CNN backbones. Experiments show that CAPTN achieves state-of-the-art results on multiple benchmark texture and material datasets.
Poster
Shuhao Zhang · Hui Kang · Yang Liu · Fang Mei · Hongjuan Li
[ ExHall D ]
Abstract
Attention-based arbitrary style transfer methods have gained significant attention recently due to their impressive ability to synthesize style details. However, the point-wise matching within the attention mechanism may overly focus on local patterns such that neglect the remarkable global features of style images. Additionally, when processing large images, the quadratic complexity of the attention mechanism will bring high computational load. To alleviate above problems, we propose Holistic Style Injector (HSI), a novel attention-style transformation module to deliver artistic expression of target style. Specifically, HSI performs stylization only based on global style representation that is more in line with the characteristics of style transfer, to avoid generating local disharmonious patterns in stylized images. Moreover, we propose a dual relation learning mechanism inside the HSI to dynamically render images by leveraging semantic similarity in content and style, ensuring the stylized images preserve the original content and improve style fidelity. Note that the proposed HSI achieves linear computational complexity because it establishes feature mapping through element-wise multiplication rather than matrix multiplication. Qualitative and quantitative results demonstrate that our method outperforms state-of-the-art approaches in both effectiveness and efficiency.
Poster
Mingkun Lei · Xue Song · Beier Zhu · Hao Wang · Chi Zhang
[ ExHall D ]
Abstract
Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content.In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
Poster
Srikar Yellapragada · Alexandros Graikos · Kostas Triaridis · Prateek Prasanna · Rajarsi Gupta · Joel Saltz · Dimitris Samaras
[ ExHall D ]
Abstract
Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to 4096×4096 pixels and 4× super-resolution. Additionally, multi-scale features extracted from …
Poster
Jinjin Zhang · qiuyu Huang · Junjie Liu · Xiefan Guo · Di Huang
[ ExHall D ]
Abstract
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models.The core advancements include:(1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o.Additionally, we introduce GLCM Score and compression ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images.(2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images.Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B).Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.The code and dataset will be made publicly available soon.
Poster
Yoonjeon Kim · Soohyun Ryu · Yeonsung Jung · Hyunkoo Lee · Joowon Kim · June Yong Yang · Jaeryong Hwang · Eunho Yang
[ ExHall D ]
Abstract
The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, which is indiscriminately applying the same criteria on completely different contexts and biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP generates detailed textual descriptions of the source and target, then calculates a modification vector through a hyperplane in CLIP space that separates source and target attributes. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The …
Poster
Shanshan Huang · Haoxuan Li · Chunyuan Zheng · Lei Wang · Guorui Liao · Zhili Gong · Huayi Yang · Li Liu
[ ExHall D ]
Abstract
A key challenge for controllable image editing is the fact that visual attributes with semantic meanings are not always independent of each other, resulting in spurious correlations in model training. However,most existing methods ignore such issue, leading to biased causal representations learning and unintended changes to unrelated features in the edited images.This leads us to present a diffusion-based causal representation learning framework called CIDiffuser that employs structural causal models (SCMs) to capture causal representations of visual attributes to address the spurious correlation.The framework first adopts a semanticencoder to decompose the representation into the target part, which includes visual attributes of interest to the user, and the other" part.We then introduce a direct causal effect learning module to capture the total direct causal effect between the potential outcomes before and after intervening on the visual attributes.In addition, a diffusion-based learning strategy is designed to optimize the representation learning process.Empirical evaluations on two benchmark datasets and one in-house dataset suggest our approach significantly outperforms the state-of-the-art methods, enabling controllable image editing by modifying learned visual representations.
Poster
Wenhao Gu · Li Gu · Ching Suen · Yang Wang
[ ExHall D ]
Abstract
Recent advancements in handwritten text recognition (HTR) have enabled effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompts adaptation with unlabeled test-time examples. To ensure the self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods with up to 8x speedup. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in …
Poster
Zihao Wang · Yuxiang Wei · Fan Li · Renjing Pei · Hang Xu · Wangmeng Zuo
[ ExHall D ]
Abstract
Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available.
Poster
Shoufa Chen · Chongjian GE · Yuqi Zhang · Yida Zhang · Fengda Zhu · Hao Yang · Hongxiang Hao · hui wu · Zhichao Lai · Yifei Hu · Ting-Che Lin · Shilong Zhang · Fu Li · Chuan Li · Xing Wang · Yanghua Peng · Peize Sun · Ping Luo · Yi Jiang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu
[ ExHall D ]
Abstract
This paper presents our latest advancements, *Goku*, a new family of joint image-and-video generation models based on rectified flow Transformers to achieve industry-grade performance. We present the foundational elements required for high-quality visual generation, including data curation, model design, flow formulation, etc. Key contributions inclued a meticulous data filtering pipeline that ensures high-quality, fine-grained image and video data curation; and the pioneering use of rectified flow for enhanced interaction among video and image tokens. Goku models achieve superior performance in both qualitative and quantitative assessments. Notably, \ours achieves top scores on major benchmarks: 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, alongside 82.7 on VBench for text-to-video tasks. We hope this report offers valuable insights into joint image-and-video generation models for the research community.
Poster
Weimin Qiu · Jieke Wang · Meng Tang
[ ExHall D ]
Abstract
Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.
Poster
Chao Wang · Hehe Fan · Huichen Yang · Sarvnaz Karimi · Lina Yao · Yi Yang
[ ExHall D ]
Abstract
Diffusion-based Text-to-Image (T2I) models have demonstrated significant potential in image restoration. However, existing models continue to grapple with challenges such as complex training and prompt design. We introduce a new perspective for improving image restoration by injecting knowledge from pretrained vision-language models into current T2I models. We empirically show that the degradation and content representations in BLIP-2 can be linearly separated, providing promising degradation guidance for image restoration. Specifically, the Feature Difference Instruction (FDI) is first extracted by Q-Formers through a simple subtraction operation based on reference image pairs. Then, we propose a multi-scale FDI adapter to decouple the degradation style and corrupted artifacts, and inject the styleflow exclusively into specific blocks through adapter-tuning, thereby preventing noise interference and eschewing the need for cumbersome weight retraining. In this way, we can train various task-specific adapters according to different degradations, achieving rich detail enhancement in the restoration results. Furthermore, the proposed FDI adapters have attractive properties of practical value, such as composability and generalization ability for all-in-one and mixed-degradation restoration. Extensive experiments under various settings demonstrate that our method has promising repairing quality over 10 image restoration tasks and a wide range of other applications. Codes will be publicly available.
Poster
Sanghyeon Na · Yonggyu Kim · Hyunjoon Lee
[ ExHall D ]
Abstract
Human image generation is a key focus in image synthesis due to its broad applications. However, generating high-quality human images remains challenging because even slight inaccuracies in anatomy, pose, or fine details can compromise visual realism. To address these challenges, we explore Direct Preference Optimization (DPO), a method that trains models to generate images similar to preferred (winning) images while diverging from non-preferred (losing) ones. Conventional DPO approaches typically employ generated images as winning images, which may limit the model's ability to achieve high levels of realism. To overcome this limitation, we propose an enhanced DPO approach that incorporates high-quality real images as winning images, encouraging the model to produce outputs that resemble those real images rather than generated ones. Specifically, our approach, \textbf{HG-DPO} (\textbf{H}uman image \textbf{G}eneration through \textbf{DPO}), employs a novel curriculum learning framework that allows the model to gradually improve toward generating realistic human images, making the training more feasible than attempting the improvement all at once. Furthermore, we demonstrate that HG-DPO effectively adapts to personalized text-to-image tasks, generating high-quality, identity-specific images, which highlights the practical value of our approach.
Poster
Zhendong Wang · Jianmin Bao · Shuyang Gu · Dong Chen · Wengang Zhou · Houqiang Li
[ ExHall D ]
Abstract
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
Poster
Senmao Li · Lei Wang · Kai Wang · Tao Liu · Jiehang Xie · Joost van de Weijer · Fahad Shahbaz Khan · Shiqi Yang · Yaxing Wang · Jian Yang
[ ExHall D ]
Abstract
Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps.Based on these observations, we introduce the first Time-independent Unified Encoder (TiUE) for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
Poster
Boming Miao · Chunxiao Li · Xiaoxiao Wang · Andi Zhang · Rui Sun · Zizhe Wang · Yao Zhu
[ ExHall D ]
Abstract
Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
Poster
Jian Jin · Zhenbo Yu · Yang Shen · Zhenyong Fu · Jian Yang
[ ExHall D ]
Abstract
Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.
Poster
Soobin Um · Jong Chul Ye
[ ExHall D ]
Abstract
We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of *text-conditional* data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.
Poster
Kyungmin Jo · Jooyeol Yun · Jaegul Choo
[ ExHall D ]
Abstract
While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts.In this paper, we identify two common issues in existing methods of modifying self-attention that hinder diffusion models from reflecting the image prompt. By addressing these issues, we propose a novel method that generates images that properly reflect the details of image prompts. First, existing approaches often neglect the importance of image prompts in classifier-free guidance, which directs the model towards the intended conditions and away from those undesirable. Specifically, current methods use image prompts as both desired and undesired conditions, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt.In …
Poster
Zijing Hu · Fengda Zhang · Long Chen · Kun Kuang · Jiahui Li · Kaifeng Gao · Jun Xiao · Xin Wang · Wenwu Zhu
[ ExHall D ]
Abstract
Diffusion-based models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named B2-DiffuRL, employs two strategies: **B**ackward progressive training and **B**ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of the denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty associated with sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. …
Poster
Lingjie Kong · Kai WU · Chengming Xu · Xiaobin Hu · Wenhui Han · Jinlong Peng · Donghao Luo · Mengtian Li · Jiangning Zhang · Chengjie Wang · Yanwei Fu
[ ExHall D ]
Abstract
Recent advances in diffusion-based text-to-image models have simplified creating high-fidelity images, but preserving the identity (ID) of specific elements, like a personal dog, is still challenging.Object customization, using reference images and textual descriptions, is key to addressing this issue. Current object customization methods are either object-specific, requiring extensive fine-tuning, or object-agnostic, offering zero-shot customization but limited to specialized domains. The primary issue of promoting zero-shot object customization from specific domains to the general domain is to establish a large-scale general ID dataset for model pre-training, which is time-consuming and labor-intensive. In this paper, we propose a novel pipeline to construct a large dataset of general objects and build the Multi-Category ID-Consistent (MC-IDC) dataset, featuring 315k text-image samples across 10k categories. With the help of MC-IDC, we introduce Customizing Anything (CustAny), a zero-shot framework that maintains ID fidelity and supports flexible text editing for general objects. CustAny features three key components: a general ID extraction module, a dual-level ID injection module, and an ID-aware decoupling module, allowing it to customize any object from a single reference image and text prompt. Experiments demonstrate that CustAny outperforms existing methods in both general object customization and specialized domains like human customization and virtual try-on. …
Poster
Yuyang Peng · Shishi Xiao · Keming Wu · Qisheng Liao · Bohan CHEN · Kevin Lin · Danqing Huang · Ji Li · Yuhui Yuan
[ ExHall D ]
Abstract
Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e.,Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on …
Poster
Taeyoung Yun · Dinghuai Zhang · Jinkyoo Park · Ling Pan
[ ExHall D ]
Abstract
Recent advances in text-to-image diffusion models have demonstrated impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors.To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation.However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded …
Poster
Xiaomin Li · yixuan liu · Takashi Isobe · Xu Jia · Qinpeng Cui · Dong Zhou · Dong Li · You He · Huchuan Lu · Zhongdao Wang · Emad Barsoum
[ ExHall D ]
Abstract
In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal.In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings.We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities.For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideCrafter2, resulting in consistent performance improvements across the board. Code and learned negative embeddings will be released.
Poster
Zehuan Huang · Yuanchen Guo · Xingqiao An · Yunhan Yang · Yangguang Li · Zi-Xin Zou · Ding Liang · Xihui Liu · Yan-Pei Cao · Lu Sheng
[ ExHall D ]
Abstract
This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
Poster
Yuchao Gu · Yipin Zhou · Yunfan Ye · Yixin Nie · Licheng Yu · Pingchuan Ma · Kevin Qinghong Lin · Mike Zheng Shou
[ ExHall D ]
Abstract
Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.
Poster
Hanzhe Hu · Tianwei Yin · Fujun Luan · Yiwei Hu · Hao Tan · Zexiang Xu · Sai Bi · Shubham Tulsiani · Kai Zhang
[ ExHall D ]
Abstract
We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator, and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.
Poster
Zhipeng Huang · Shaobin Zhuang · Canmiao Fu · Binxin Yang · Ying Zhang · Chong Sun · Chen Li · Yali Wang · Zhizheng Zhang · Zheng-Jun Zha
[ ExHall D ]
Abstract
Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce ChatGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable ChatGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in ChatGen and show it achieves state-of-the-art performance across various visual …
Poster
Ronghuan Wu · Wanchao Su · Jing Liao
[ ExHall D ]
Abstract
Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
Poster
Sohan Patnaik · Rishabh Jain · Balaji Krishnamurthy · Mausoom Sarkar
[ ExHall D ]
Abstract
Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment (AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16% improvement over current State-of-The-Art methods, thereby highlighting the …
Poster
Andreas Müller · Denis Lukovnikov · Jonas Thietke · Asja Fischer · Erwin Quiring
[ ExHall D ]
Abstract
Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.
Poster
Feng Zhou · Ruiyang Liu · chen liu · Gaofeng He · Yonglu Li · Xiaogang Jin · Huamin Wang
[ ExHall D ]
Abstract
Sewing patterns, the essential blueprints for fabric cutting and tailoring, act as a crucial bridge between design concepts and producible garments.However, existing uni-modal sewing pattern generation models struggle to effectively encode complex design concepts with a multi-modal nature and correlate them with vectorized sewing patterns that possess precise geometric structures and intricate sewing relations.In this work, we propose a novel sewing pattern generation approach Design2GarmentCode based on Large Multimodal Models (LMMs), to generate parametric pattern-making programs from multi-modal design concepts.LMM offers an intuitive interfaces for interpreting diverse design inputs, while pattern-making programs could serve as well-structured and semantically meaningful representations of sewing patterns, and act as a robust bridge connecting the cross-domain pattern-making knowledge embedded in LMMs with vectorized sewing patterns.Experimental results demonstrate that our method can flexibly handle various complex design expressions such as images, textual descriptions, designer sketches, or their combinations, and convert them into size-precise sewing patterns with correct stitches. Compared to previous methods, our approach significantly enhances training efficiency, generation quality, and authoring flexibility. Our code and data will be publicly available.
Poster
Xinghui Li · Qichao Sun · Pengze Zhang · Fulong Ye · Zhichao Liao · Wanquan Feng · Songtao Zhao · Qian HE
[ ExHall D ]
Abstract
Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing primarily comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as …
Poster
Junying Wang · Hongyuan Zhang · Yuan Yuan
[ ExHall D ]
Abstract
Recent personalized portrait generation methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and customized portrait generation, we develop a multi-modal image customizer capable of generating controllable fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into customized portrait generation. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
Poster
Fernando Julio Cendra · Kai Han
[ ExHall D ]
Abstract
The inherent ambiguity in the definition of visual concepts poses significant challenges for modern generative models, like the Text-to-Image (T2I) models based on diffusion models, in accurately learning concepts from the input images. Existing methods lack a systematic framework and interpretative mechanisms, hindering reliable extraction of the underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework to automatically and systematically extract intrinsic concepts from a single image leveraging a T2I model. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module that pinpoints relevant text-based concepts and their corresponding masks within a given image. This critical phase not only streamlines concept initialization but also offers precise guidance for the subsequent analysis. The second stage delves deeper into each identified mask, decomposing concepts into intrinsic components, capturing specific visual characteristics and general components representing broader categories. This decomposition facilitates a more granular understanding by further dissecting concepts into detailed intrinsic attributes such as colour and material. Extensive experiments validate that ICE achieves superior performance on intrinsic concept extraction, enabling reliable and flexible application to downstream tasks like personalized image generation, image editing, and so on. …
Poster
Sangwon Jung · Alex Oesterling · Claudio Mayrink Verdun · Sajani Vithana · Taesup Moon · Flavio Calmon
[ ExHall D ]
Abstract
Text-to-image generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the "safe" and "responsible" design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in large image generation models. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by text-to-image generative models. We propose a novel application of the Multi-Group Proportional Representation (MPR) metric to rigorously evaluate representative harms in image generation and develop an algorithm to optimize generative models for this representational metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.
Poster
Logan Frank · Jim Davis
[ ExHall D ]
Abstract
Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher's original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: "what makes a good dataset for transferring knowledge from teacher to student?" Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code will be available in the future.
Poster
Koushik Srivatsan · Fahad Shamshad · Muzammal Naseer · Vishal M. Patel · Karthik Nandakumar
[ ExHall D ]
Abstract
The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security, because concept-erased models (CEMs) can be easily deceived through adversarial attacks to generate the erased concept. Though some robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding-space attacks. These limitations stem from the failure of robust CEMs to search for “blind spots” in the embedding space thoroughly. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step rather than the only step for robust concept erasure. In the first stage, STEREO employs adversarial training as a vulnerability identification mechanism to search thoroughly enough. In the second robustly erase once stage, STEREO introduces an anchor-concept-based compositional objective to robustly erase the target concept at one go while attempting to minimize the degradation on model utility. We benchmark STEREO against 7 state-of-the-art concept erasure methods, demonstrating its enhanced robustness against whitebox, …
Poster
Xinting Hu · Haoran Wang · Jan Lenssen · Bernt Schiele
[ ExHall D ]
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion model to generate identity-consistent human-object interaction (HOI) images. While personalized face diffusion (PFD) models have advanced significantly, they often overfit facial features and fail to produce coherent full-body interactions with objects. To address this issue, PersonaHOI introduces an additional StableDiffusion (SD) branch to follow HOI-driven text descriptions in image generation. By incorporating proposed cross-attention constraints in the PFD branch, and spatial fusion strategies between SD and PFD branches at both the latent and residual level, PersonaHOI successfully blends personalized facial details with interactive non-facial regions, ensuring identity preservation and interaction coherence. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation.
Poster
Junxi Chen · Junhao Dong · Xiaohua Xie
[ ExHall D ]
Abstract
Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations.
Poster
Won Jun Kim · Hyungjin Chung · Jaemin Kim · Sangmin Lee · Byeongsu Sim · Jong Chul Ye
[ ExHall D ]
Abstract
Gradient-based methods are a prototypical family of "explainability for AI" (XAI) techniques, especially for image-based models.Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce "Derivative-Free Diffusion Manifold-Contrained Gradients (FreeMCG)", a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs (i.e., in a completely black-box setting). We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks - counterfactual explanation and feature attribution - we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
Poster
Hanhui Wang · Yihua Zhang · Ruizheng Bai · Yue Zhao · Sijia Liu · Zhengzhong Tu
[ ExHall D ]
Abstract
Recent advancements in diffusion models have made generative image editing more accessible than ever. While these developments allow users to generate creative edits with ease, they also raise significant ethical concerns, particularly regarding malicious edits to human portraits that threaten individuals' privacy and identity security. Existing general-purpose image protection methods primarily focus on generating adversarial perturbations to nullify edit effects. However, these approaches often exhibit instability to protect against diverse editing requests. In this work, we introduce a novel perspective to personal human portrait protection against malicious editing. Unlike traditional methods aiming to prevent edits from taking effect, our method, FaceLock, optimizes adversarial perturbations to ensure that original biometric information---such as facial features---is either destroyed or substantially altered post-editing, rendering the subject in the edited output biometrically unrecognizable. Our approach innovatively integrates facial recognition and visual perception factors into the perturbation optimization process, ensuring robust protection against a variety of editing attempts. Besides, we shed light on several critical issues with commonly used evaluation metrics in image editing and reveal cheating methods by which they can be easily manipulated, leading to deceptive assessments of protection. Through extensive experiments, we demonstrate that FaceLock significantly outperforms all baselines in defense performance against …
Poster
Yuechen Xie · Jie Song · Huiqiong Wang · Mingli Song
[ ExHall D ]
Abstract
High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-α, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99\% in determining the provenance of suspicious model training data, surpassing all previous methods. Code will be publicly available soon.
Poster
Haifeng Zhang · Qinghui He · Xiuli Bi · Weisheng Li · Bo Liu · Bin Xiao
[ ExHall D ]
Abstract
The rapid advancement of generative models has significantly improved the quality of generated images. Meanwhile, it challenges information authenticity and credibility. Current generated image detection methods based on large-scale pre-trained multimodal models have achieved impressive results. Although these models provide abundant features, the authentication task-related features are often submerged. Consequently, those authentication task-irrelated features cause models to learn superficial biases, thereby harming their generalization performance across different model genera (e.g., GANs and Diffusion Models). To this end, we proposed VIB-Net, which uses Variational Information Bottlenecks to enforce authentication task-related feature learning. We tested and analyzed the proposed method and existing methods on samples generated by 17 different generative models. Compared to SOTA methods, VIB-Net achieved a 4.62% improvement in mAP and a 9.33% increase in accuracy. Notably, in generalization tests on unseen generative models from different series, VIB-Net improved mAP by 12.48% and accuracy by 23.59% over SOTA methods.
Poster
Qi Bi · Jingjun Yi · Huimin Huang · Hao Zheng · Haolan Zhan · Yawen Huang · Yuexiang Li · Xian Wu · Yefeng Zheng
[ ExHall D ]
Abstract
Night-time scene segmentation is a critical yet challenging task in the real-world applications, primarily due to the complicated lighting conditions. However, existing methods lack sufficient generalization ability to unseen nigh-time scenes with varying illumination.In light of this issue, we focus on investigating generalizable paradigms for night-time scene segmentation and propose an efficient fine-tuning scheme, dubbed as \texttt{NightAdapter}, alleviating the domain gap across various scenes.Interestingly, different properties embedded in the day-time and night-time features can be characterized by the bands after discrete sine transformation, which can be categorized into illumination-sensitive/-insensitive bands.Hence, our \texttt{NightAdapter} is powered by two appealing designs: (1) Illumination-Insensitive Band Adaptation that provides a foundation for understanding the prior, enhancing the robustness to illumination shifts; (2) Illumination-Sensitive Band Adaptation that fine-tunes the randomized frequency bands, mitigating the domain gap between the day-time and various night-time scenes. As a consequence, illumination-insensitive enhancement improves the domain invariance, while illumination-sensitive diminution strengthens the domain shift between different scenes.\texttt{NightAdapter} yields significant improvements over the state-of-the-art methods under various day-to-night, night-to-night, and in-domain night segmentation experiments.We will release our code.
Poster
Yongqi Yang · Zhihao Qian · Ye Zhu · Olga Russakovsky · Yu Wu
[ ExHall D ]
Abstract
The boom of Generative AI brings opportunities entangled with risks and concerns.Existing literature emphasizes the generalization capability of deepfake detection on unseen generators, significantly promoting the detector's ability to identify more universal artifacts.In this work, we seek a step toward a universal deepfake detection system with better generalization and robustness. We do so by first scaling up the existing detection task setup from the one-generator to multiple-generators in training, during which we disclose two challenges presented in prior methodological designs and demonstrate the divergence of detectors' performance.Specifically, we reveal that the current methods tailored for training on one specific generator either struggle to learn comprehensive artifacts from multiple generators or sacrifice their fitting ability for seen generators (i.e., _In-Domain_ (ID) performance) to exchange the generalization for unseen generators (i.e., _Out-Of-Domain_ (OOD) performance). And detectors' similar performance will diverge during the scaling up of generators.To tackle the above challenges, we propose our **D**iscrepancy **D**eepfake **D**etector (**D3**) framework, whose core idea is to deconstruct the universal artifacts from multiple generators by introducing a parallel network branch that takes a distorted image feature as extra discrepancy signal and supplement its original counterpart. Extensive scaled-up experiments demonstrate the effectiveness of **D3**, achieving 5.3\% accuracy …
Poster
Feng Yan · Xiaoheng Jiang · Yang Lu · Jiale Cao · Dong Chen · Mingliang Xu
[ ExHall D ]
Abstract
As an important part of intelligent manufacturing, pixel-level surface defect detection (SDD) aims to locate defect areas through mask prediction. Previous methods adopt the image-independent static convolution to indiscriminately classify per-pixel features for mask prediction, which leads to suboptimal results for some challenging scenes such as weak defects and cluttered backgrounds. In this paper, inspired by query-based methods, we propose a Wavelet and Prototype Augmented Query-based Transformer (WPFormer) for surface defect detection. Specifically, a set of dynamic queries for mask prediction is updated through the dual-domain transformer decoder. Firstly, a Wavelet-enhanced Cross-Attention (WCA) is proposed, which aggregates meaningful high- and low-frequency information of image features in the wavelet domain to refine queries. WCA enhances the representation of high-frequency components by capturing relationships between different frequency components, enabling queries to focus more on defect details. Secondly, a Prototype-guided Cross-Attention (PCA) is proposed to refine queries through meta-prototypes in the spatial domain. The prototypes aggregate semantically meaningful tokens from image features, facilitating queries to aggregate crucial defect information under the cluttered backgrounds. Extensive experiments on three defect detection datasets (i.e., ESDIs-SOD, CrackSeg9k, and ZJU-Leaper) demonstrate that the proposed method achieves state-of-the-art performance in defect detection.
Poster
Zhanqiang Guo · Jiamin Wu · Yonghao Song · Jiahui Bu · Weijian Mai · Qihao Zheng · Wanli Ouyang · Chunfeng Song
[ ExHall D ]
Abstract
Human's perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images. Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder. To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and …
Poster
Sahar Dastani · Ali Bahri · Moslem Yazdanpanah · Mehrdad Noori · David OSOWIECHI · Gustavo Vargas Hakim · Farzad Beizaee · Milad Cheraghalikhani · Arnab Mondal · Herve Lombaert · Christian Desrosiers
[ ExHall D ]
Abstract
State Space Models (SSMs) have recently emerged as an alternative to Vision Transformers (ViTs) due to their unique ability of modeling global relationships with linear complexity. SSMs are specifically designed to capture spatially proximate relationships of image patches. However, they fail to identify relationships between conceptually related yet not adjacent patches. This limitation arises from the non-causal nature of image data, which lacks inherent directional relationships. Additionally, current vision-based SSMs are highly sensitive to transformations such as rotation. Their predefined scanning directions depend on the original image orientation, which can cause the model to produce inconsistent patch-processing sequences after rotation.To address these limitations, we introduce Spectral VMamba, a novel approach that effectively captures the global structure within an image by leveraging spectral information derived from the graph Laplacian of image patches. Through spectral decomposition, our approach encodes patch relationships independently of image orientation, achieving rotation invariance with the aid of our Rotational Feature Normalizer (RFN) module. Our experiments on classification tasks show that Spectral VMamba outperforms the leading SSM models in vision, such as VMamba, while maintaining invariance to rotations and a providing a similar runtime efficiency.
Poster
Yihua Cheng · Hengfei Wang · Zhongqun Zhang · Yang Yue · Boeun Kim · Feng Lu · Hyung Jin Chang
[ ExHall D ]
Abstract
3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for …
Poster
Toby Perrett · Ahmad Darkhalil · Saptarshi Sinha · Omar Emara · Sam Pollard · Kranti Kumar Parida · Kaiting Liu · Prajwal Gatti · Siddhant Bansal · Kevin Flanagan · Jacob Chalk · Zhifan Zhu · Rhodri Guerrier · Fahd Abdelazim · Bin Zhu · Davide Moltisanti · Michael Wray · Hazel Doughty · Dima Damen
[ ExHall D ]
Abstract
We present a validation dataset of newly-collected kitchen based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HD-EPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 37.0% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 404 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per min of our unscripted videos.
Poster
Fan Qi · KunSheng Ma · Changsheng Xu
[ ExHall D ]
Abstract
Recent advancements in latent diffusion models (LDMs) have led to innovative approaches in music generation, allowing for increased flexibility and integration with other modalities. However, existing methods often rely on a two-step process that fails to capture the artistic essence of videos, particularly in the context of complex videos requiring detailed sound effect and diverse instrumentation. In this paper, we propose a novel framework for generating video soundtracks that simultaneously produces music and sound effect tailored to the video content. Our method incorporates a Contrastive Visual-Sound-Music pretraining process that maps these modalities into a unified feature space, enhancing the model's ability to capture intricate audio dynamics. We design Spectrum Divergence Masked Attention for Unet to differentiate between the unique characteristics of sound effect and music. We utilize Score-guided Noise Iterative Optimization to provide musicians with customizable control during the generation process. Extensive evaluations on the FilmScoreDB and SymMV\&HIMV datasets demonstrate that our approach significantly outperforms state-of-the-art baselines in both subjective and objective assessments, highlighting its potential as a robust tool for video soundtrack generation.
Poster
Chao Huang · Ruohan Gao · J. M. F. Tsang · Jan Kurcius · Cagdas Bilen · Chenliang Xu · Anurag Kumar · Sanjeel Parekh
[ ExHall D ]
Abstract
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset--the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process---separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Readers are encouraged to see video results in supplements.
Poster
Anna Min · Ziyang Chen · Hang Zhao · Andrew Owens
[ ExHall D ]
Abstract
We present a method for learning binaural sound localization from ego-motion in videos. When the camera moves in a video, the direction of sound sources will change along with it. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this idea, we propose a dataset of real-world audio-visual videos with ego-motion. We show that our model can successfully learn from this real-world data, and that it obtains strong performance on sound localization tasks.
Poster
Abduljalil Radman · Jorma Laaksonen
[ ExHall D ]
Abstract
Referring audio-visual segmentation (Ref-AVS) aims to segment objects within audio-visual scenes using multimodal cues embedded in text expressions. While the Segment Anything Model (SAM) has revolutionized visual segmentation, its applicability to Ref-AVS, where multimodal cues act as novel prompts, remains unexplored. SAM’s limitation to single-frame segmentation also hinders its ability to capture essential temporal context needed for multi-frame audio-visual segmentation. To address this gap, we propose TSAM, a novel extension of SAM designed to leverage multimodal cues for precise segmentation in dynamic audio-visual scenes. TSAM enhances SAM’s image encoder with a temporal modeling branch, enabling spatio-temporal learning and deep multimodal fusion across video frames, while retaining SAM’s pre-trained knowledge. Additionally, TSAM replaces SAM’s user-interactive prompting mechanism with sparse and dense data-driven prompts, enabling more effective integration of audio-visual inputs and reference text expressions. Extensive experiments on the Ref-AVS dataset demonstrate the superiority of our proposed TSAM over state-of-the-art methods, underscoring its effectiveness in accurately segmenting objects in audio-visual scenes guided by text-based multimodal cues and its strong generalization to unseen objects.
Poster
Liang Liu · Shuaiyong Li · Yongqiang Zhu
[ ExHall D ]
Abstract
Audio-visual event localization (AVEL) involves identifying the category and the corresponding temporal boundary of an event that is both audible and visible in unconstrained videos. However, the semantic gap between heterogeneous modalities often leads to audio-visual semantic inconsistency. In this paper, we propose a novel Audio-Visual Semantic Graph Network (AVSGN) to facilitate cross-modal alignment and cross-temporal interaction. Unlike previous methods (e.g., audio-guided, visual-guided, or both), we introduce shared semantic textual labels to bridge the semantic gap between audio and visual modalities. Specifically, we present a cross-modal semantic alignment (CMSA) module to explore the cross-modal complementary relationships across heterogeneous modalities (i.e., visual, audio and text), promoting the convergence of multimodal distributions into a common semantic space. Additionally, in order to capture cross-temporal associations sufficiently, we devise a cross-modal graph interaction (CMGI) module, which disentangles complicated interactions across modalities into three complementary subgraphs. Extensive experiments on the AVE dataset comprehensively demonstrate the superiority and effectiveness of the proposed model in both fully- and weakly-supervised AVE settings.
Poster
Huangbiao Xu · Xiao Ke · Huanqi Wu · Rui Xu · Yuezhou Li · Wenzhong Guo
[ ExHall D ]
Abstract
Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations as well as action-music coordination. However, there is no direct correlation between the diverse background music and movements in sporting events. Previous works require larger model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learning (MLAVL) framework that models audio-action-visual correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowledge graphs, motivating audio-visual modalities to focus on task-relevant actions. We further design a shared-specific context encoder to integrate deep multimodal semantics, and an audio-visual cross-modal fusion module to evaluate action-music consistency. To match the sport's rules, we then propose a dual-branch prompt-guided grading module to weigh both visual and audio-visual performance. Extensive experiments demonstrate that our approach achieves state-of-the-art on four public long-term sports benchmarks while maintaining low parameters. Our code will be available.
Poster
Shuai Tan · Biao Gong · Yutong Feng · Kecheng Zheng · DanDan Zheng · Shuwei Shi · Yujun Shen · Jingdong Chen · Ming Yang
[ ExHall D ]
Abstract
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of our approach in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. The code and models will be made publicly available.
Poster
Ziyi Wu · Aliaksandr Siarohin · Willi Menapace · Ivan Skorokhodov · Yuwei Fang · Varnith Chordia · Igor Gilitschenski · Sergey Tulyakov
[ ExHall D ]
Abstract
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order.To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation.By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events.For the first time in the literature, our model offers control over the timing of events in generated videos.Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.Additional results and details are available on our website in the supplementary material.
Poster
Kun Liu · Qi Liu · Xinchen Liu · Jie Li · Yongdong Zhang · Jiebo Luo · Xiaodong He · Wu Liu
[ ExHall D ]
Abstract
Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first large-scale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation.
Poster
Duowang Zhu · Xiaohu Huang · Haiyan Huang · Hao Zhou · Zhenfeng Shao
[ ExHall D ]
Abstract
In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light …
Poster
Darryl Ho · Samuel Madden
[ ExHall D ]
Abstract
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of …
Poster
Yang Liu · Qianqian Xu · Peisong Wen · Siran Dai · Qingming Huang
[ ExHall D ]
Abstract
The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available in the Supplementary Material.
Poster
Rui Qian · Shuangrui Ding · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang
[ ExHall D ]
Abstract
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a solution built on a Disentangled Perception, Decision, and Reaction framework. Dispider features a lightweight Proactive Streaming Video Processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous Precise Interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments prove that Dispider …
Poster
Haitong Liu · Kuofeng Gao · Yang Bai · Jinmin Li · Jinxiao Shan · Tao Dai · Shu-Tao Xia
[ ExHall D ]
Abstract
Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named **Ramblings** and **Mutes**. Concretely, **Ramblings** aim to mislead video-based LLMs into generating inaccurate captions for the original videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. **Mutes**, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content.
Poster
Zijia Lu · ASM Iftekhar · Gaurav Mittal · Tianjian Meng · Xiawei Wang · Cheng Zhao · Rohith Kukkala · Ehsan Elhamifar · Mei Chen
[ ExHall D ]
Abstract
Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing "delegate-and-conquer" strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Code and model will be released upon acceptance.
Poster
Chan Hur · Jeong-hun Hong · Dong-hun Lee · Dabin Kang · Semin Myeong · Sang-hyo Park · Hyeyoung Park
[ ExHall D ]
Abstract
In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets. The code will be available at [github]
Poster
weixing chen · Yang Liu · Binglin Chen · Jiandong Su · Yongsen Zheng · Liang Lin
[ ExHall D ]
Abstract
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, although large models possess extensive prior knowledge and can demonstrate strong performance in a zero-shot setting, issues such as spurious correlations persist, making their application to specific downstream tasks challenging. In this work, we propose a novel causality-ware VideoQG framework named Cross-modal Causality Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Attention Grounding (GSAG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter. ii) Cross-modal Alignment (CA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features. iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving …
Poster
Luca Zanella · Massimiliano Mancini · Willi Menapace · Sergey Tulyakov · Yiming Wang · Elisa Ricci
[ ExHall D ]
Abstract
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first …
Poster
Chaoyou Fu · Yuhan Dai · Yongdong Luo · Lei Li · Shuhuai Ren · Renrui Zhang · Zihan Wang · Chenyu Zhou · Yunhang Shen · Mengdan Zhang · Peixian Chen · Yanwei Li · Shaohui Lin · Sirui Zhao · Ke Li · Tong Xu · Xiawu Zheng · Enhong Chen · Caifeng Shan · Ran He · Xing Sun
[ ExHall D ]
Abstract
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs to process sequential visual data is still insufficiently explored, highlighting the lack of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, and reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75\%, compared to 71.9% for …
Poster
Jinhui Yi · Syed Talal Wasim · Yanan Luo · Muzammal Naseer · Jürgen Gall
[ ExHall D ]
Abstract
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5× reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4× faster processing speeds than previous methods.
Poster
Chiara Plizzari · Alessio Tonioni · Yongqin Xian · Achin Kulshrestha · Federico Tombari
[ ExHall D ]
Abstract
Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input.To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics in egocentric settings.The dataset will be made publicly available upon acceptance.
Poster
Quan Zhang · Jinwei Fang · Rui Yuan · Xi Tang · Yuxin Qi · Ke Zhang · Chun Yuan
[ ExHall D ]
Abstract
Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic textual cues for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.
Poster
Reno Kriz · Kate Sanders · David Etter · Kenton Murray · Cameron Carpenter · Hannah Recknor · Jimena Guallar-Blasco · Alexander Martin · Eugene Yang · Benjamin Van Durme
[ ExHall D ]
Abstract
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce \textbf{MultiVENT 2.0}, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and over 3,900 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.
Poster
Zijia Zhao · Yuqi Huo · Tongtian Yue · Longteng Guo · Haoyu Lu · Bingning Wang · Weipeng Chen · Jing Liu
[ ExHall D ]
Abstract
Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce **EMA**, an **E**fficient **M**otion-**A**ware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
Poster
Zeyi Huang · Yuyang Ji · Xiaofang Wang · Nikhil Mehta · Tong Xiao · Donghyun Lee · Sigmund VanValkenburgh · Shengxin Zha · Bolin Lai · Licheng Yu · Ning Zhang · Yong Jae Lee · Miao Liu
[ ExHall D ]
Abstract
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
Poster
Jiawei Tan · Hongxing Wang · Junwu Weng · Jiaxin Li · Zhilong Ou · Kang Dang
[ ExHall D ]
Abstract
Video moment retrieval aims to locate specific moments from a video according to the query text. This task presents two main challenges: i) aligning the query and video frames at the feature level, and ii) projecting the query-aligned frame features to the start and end boundaries of the matching interval. Previous work commonly involves all frames in feature alignment, easy to cause aligning irrelevant frames with the query. Furthermore, they forcibly map visual features to interval boundaries but ignoring the information gap between them, yielding suboptimal performance. In this study, to reduce distraction from irrelevant frames, we designate an anchor frame as that with the maximum query-frame relevance measured by the established Vision-Language Model. Via similarity comparison between the anchor frame and the others, we produce a semantically compact segment around the anchor frame, which serves as a guide to align features of query and related frames. We observe that such a feature alignment will make similarity cohesive between target frames, which enables us to predict the interval boundaries by a single point detection in the 2D semantic similarity space of frames, thus well bridging the information gap between frame semantics and temporal boundaries. Experimental results across various datasets demonstrate …
Poster
Yisen Feng · Haoyu Zhang · Meng Liu · Weili Guan · Liqiang Nie
[ ExHall D ]
Abstract
Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach.Our code will be released and made available in the supplementary material.
Poster
Aditya Chinchure · Sahithya Ravi · Raymond Ng · Vered Shwartz · Boyang Li · Leonid Sigal
[ ExHall D ]
Abstract
The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations …
Poster
Hesham Syed · Yun Liu · Guolei Sun · Henghui Ding · Jing Yang · Ender Konukoglu · Xue Geng · Xudong Jiang
[ ExHall D ]
Abstract
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code will be released.
Poster
Jaewoo Jeong · Seohee Lee · Daehee Park · Giwon Lee · Kuk-Jin Yoon
[ ExHall D ]
Abstract
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. Their camera-based visual features enable the extraction of additional modalities (human pose, text) which enhance prediction accuracy. We focus on pedestrian motion prediction to fully utilize the rich, dynamic visual features of pedestrians. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires an use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. We validate our generalizable framework with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups. For the SIT dataset, we utilize VLM to generate captions to compensate for the lack of text annotations. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations.
Poster
Mingqiao Ye · Seoung Wug Oh · Lei Ke · Joon-Young Lee
[ ExHall D ]
Abstract
Automatically tracking and segmenting every video entity remains a significant challenge. Despite rapid advancements in video segmentation, even state-of-the-art models like SAM 2 struggle to consistently track all entities across a video—a task we refer to as Video Entity Segmentation.We propose EntitySAM, a framework for zero-shot video entity segmentation. EntitySAM extends SAM 2 by removing the need for explicit prompts, allowing automatic discovery and tracking of all entities, including those appearing in later frames. We incorporate query-based entity discovery and association into SAM 2, inspired by transformer-based object detectors. Specifically, we introduce an entity decoder to facilitate inter-object communication and an automatic prompt generator using learnable object queries. Additionally, we add a semantic encoder to enhance SAM 2's semantic awareness, improving segmentation quality. Trained on image-level mask annotations without category information from the COCO dataset, EntitySAM demonstrates strong generalization on four zero-shot video segmentation tasks: Video Entity, Panoptic, Instance, and Semantic Segmentation. Results on six popular benchmarks show that EntitySAM outperforms previous unified video segmentation methods and strong baselines, setting new standards for zero-shot video segmentation.
Poster
Md Zarif Hossain · AHMED IMTEAJ
[ ExHall D ]
Abstract
Large Vision-Language Models (LVLMs) have emerged as transformative tools in multimodal tasks, seamlessly integrating pretrained vision encoders to align visual and textual modalities. Prior works have highlighted the susceptibility of LVLMs to dual exploits (gradient-based and optimization-based jailbreak attacks), which leverage the expanded attack surface introduced by the image modality. Despite advancements in enhancing robustness, existing methods fall short in their ability to defend against dual exploits while preserving fine-grained semantic details and overall semantic coherence under intense adversarial perturbations. To bridge this gap, we introduce SLADE, a novel unsupervised adversarial fine-tuning scheme that enhances the resilience of CLIP-based vision encoders. SLADE’s dual-level contrastive learning approach balances the granular and the holistic, capturing fine-grained image details without losing sight of high-level semantic coherence. Extensive experiments demonstrate that SLADE-equipped LVLMs set a new benchmark for robustness against dual exploits while preserving fine-grained semantic details of perturbed images. Notably, SLADE achieves these results without compromising the core functionalities of LVLMs, such as instruction following, or requiring the computational overhead (e.g., large batch sizes, momentum encoders) commonly associated with traditional contrastive learning methods. The code is provided in the supplementary material with this submission.
Poster
Alan Lukezic · Jovana Videnović · Matej Kristan
[ ExHall D ]
Abstract
Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
Poster
Snehashis Majhi · Giacomo D'Amicantonio · Antitza Dantcheva · Quan Kong · Lorenzo Garattoni · Gianpiero Francesca · Egor Bondarev · Francois Bremond
[ ExHall D ]
Abstract
Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: PI-VAD (or π-VAD), a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. π-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, π-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones …
Poster
Kazi Sajeed Mehrab · M. Maruf · Arka Daw · Abhilash Neog · Harish Babu Manogaran · Mridul Khurana · Zhenyang Feng · Bahadir Altintas · Yasin Bakis · Elizabeth Campolongo · Matthew Thompson · Xiaojun Wang · Hilmar Lapp · Tanya Berger-Wolf · Paula Mabee · Henry Bart · Wei-Lun Chao · Wasla Dahdul · Anuj Karpatne
[ ExHall D ]
Abstract
The availability of large datasets of organism images combined with advances in artificial intelligence (AI) has significantly enhanced the study of organisms through images, unveiling biodiversity patterns and macro-evolutionary trends. However, existing machine learning (ML)-ready organism datasets have several limitations. First, these datasets often focus on species classification only, overlooking tasks involving visual traits of organisms. Second, they lack detailed visual trait annotations, like pixel-level segmentation, that are crucial for in-depth biological studies. Third, these datasets predominantly feature organisms in their natural habitats, posing challenges for aquatic species like fish, where underwater images often suffer from poor visual clarity, obscuring critical biological traits. This gap hampers the study of aquatic biodiversity patterns which is necessary for the assessment of climate change impacts, and evolutionary research on aquatic species morphology. To address this, we introduce the Fish-Visual Trait Analysis (Fish-Vista) dataset—a large, annotated collection of about 80K fish images spanning 3000 different species, supporting several challenging and biologically relevant tasks including species classification, trait identification, and trait segmentation. These images have been curated through a sophisticated data processing pipeline applied to a cumulative set of images obtained from various museum collections. Fish-Vista ensures that visual traits of images are clearly visible, …
Poster
Ho-Joong Kim · Yearang Lee · Jung-Ho Hong · Seong-Whan Lee
[ ExHall D ]
Abstract
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment.
Poster
Dominick Reilly · Rajatsubhra Chakraborty · Arkaprava Sinha · Manish Kumar Govind · Pu Wang · Francois Bremond · Le Xue · Srijan Das
[ ExHall D ]
Abstract
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks.Code and data will be made publicly available at https://llavidal.github.io/llavidal/
Poster
Jianyang Xie · Yitian Zhao · Yanda Meng · He Zhao · Anh Nguyen · Yalin Zheng
[ ExHall D ]
Abstract
Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code will be released upon acceptance.
Poster
Yuhao Li · Xinyue Chen · Hongkai Li · Xiaorong Pu · Peng Jin · Yazhou Ren
[ ExHall D ]
Abstract
Sign language is a visual language expressed through complex movements of the upper body. The human skeleton plays a critical role in sign language recognition due to its good separation from the video background. However, mainstream skeleton-based sign language recognition models often overly focus on the natural connections between joints, treating sign language as ordinary human movements, which neglects its linguistic characteristics. We believe that just as letters form words, each sign language gloss can also be decomposed into smaller visual symbols. To fully harness the potential of skeleton data, this paper proposes a novel joint fusion strategy and a visual symbol attention model. Specifically, we first input the complete set of skeletal joints, and after dynamically exchanging joint information, we discard the parts with the weakest connections to other joints, resulting in a fused, simplified skeleton. Then, we group the joints most likely to express the same visual symbol and discuss the joint movements within each group separately. To validate the superiority of our method, we conduct extensive experiments on multiple public benchmark datasets. The results show that, without complex pre-training, we still achieve new state-of-the-art performance.
Poster
Chun Tong Lei · Hon Ming Yam · Zhongliang Guo · Yifei Qian · Chun Pong Lau
[ ExHall D ]
Abstract
Neural networks have revolutionized numerous fields with their exceptional performance, yet they remain susceptible to adversarial attacks through subtle perturbations. While diffusion-based purification methods like DiffPure offer promising defense mechanisms, their computational overhead presents a significant practical limitation.In this paper, we introduce One Step Control Purification (OSCP), a novel defense framework that achieves robust adversarial purification in a single Neural Function Evaluation (NFE) within diffusion models.We propose Gaussian Adversarial Noise Distillation (GAND) as the distillation objective and Controlled Adversarial Purification (CAP) as the inference pipeline, which makes OSCP demonstrate remarkable efficiency while maintaining defense efficacy.Our proposed GAND addresses a fundamental tension between consistency distillation and adversarial perturbation, bridging the gap between natural and adversarial manifolds in the latent space, while remaining computationally efficient through Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, eliminating the high computational budget request from full parameter fine-tuning.The CAP guides the purification process through the unlearnable edge detection operator calculated by the input image as an extra prompt, effectively preventing the purified images from deviating from their original appearance when using large purification steps.Our experimental results on ImageNet showcase OSCP's superior performance, achieving a 74.19\% defense success rate with merely 0.1s per purification --- a 100-fold speedup …
Poster
Huu Binh Ta · Duc Nguyen · Quyen Tran · Toan Tran · Tung Pham
[ ExHall D ]
Abstract
In security-sensitive fields, data should be encrypted to protect against unauthorized access and maintain confidentiality throughout processing. However, traditional networks like ViTs and CNNs return different results when processing original data versus its encrypted form, meaning that they require data to be decrypted, posing a security risk by exposing sensitive information. One solution for this issue is using polynomial networks, including state-of-the-art Multilinear Operator Networks, which return the same outputs given the real data and their encrypted forms under Leveled Fully Homomorphic Encryption. Nevertheless, these models are susceptible to catastrophic forgetting in incremental learning settings. Thus this paper will present a new low-rank adaptation method combined with the Gradient Projection Memory mechanism to minimize the issue. Our proposal is compatible with Leveled Fully Homomorphic Encryption while achieving a sharp improvement in performance compared to existing models.
Poster
Zhuowei Li · Tianchen Zhao · Xiang Xu · Zheng Zhang · Zhihua Li · Xuanbai Chen · Qin ZHANG · Alessandro Bergamo · Anil Kumar Jain · Yifan Xing
[ ExHall D ]
Abstract
Developing a face anti-spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and the diverse end-user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share large amount of their face data with service providers. In this work, we introduce a novel method where the face anti-spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data, while keeping model parameters and training data inaccessible to the client. We develop a prototype-based base model and an optimal transport-guided adaptor that enable adaptation either in a light-weight training or training-free setting, without updating the base model's parameters. Moreover, we employ geodesic mixup, an optimal transport-based synthesis method that generates augmented training data along the geodesic path between source prototypes and the target data distribution. This allows training a lightweight classifier to effectively adapt to target-specific characteristics while retaining essential knowledge learned from the source domain. In cross-domain and cross-attack setting, compared with recent methods, our method achieves average improvements of 19.17\% in HTER and 8.58\% in AUC, respectively.
Poster
Gaojian Wang · Feng Lin · Tong Wu · Zhenguang Liu · Zhongjie Ba · Kui Ren
[ ExHall D ]
Abstract
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region Consistency and challenging inter-region Coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global Correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision Foundation Model for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
Poster
Hangtao Zhang · Yichen Wang · Shihui Yan · Chenyu Zhu · Ziqi Zhou · Linshan Hou · Shengshan Hu · Minghui Li · Yanjun Zhang · Leo Yu Zhang
[ ExHall D ]
Abstract
Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection---particularly its output of numerous objects---pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.
Poster
Tong Bu · Maohua Li · Zhaofei Yu
[ ExHall D ]
Abstract
Spiking Neural Networks (SNNs) have emerged as a promising substitute for Artificial Neural Networks (ANNs) due to their advantages of fast inference and low power consumption. However, the lack of efficient training algorithms has hindered their widespread adoption. Even efficient ANN-SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN-SNN conversion framework with only inference scale complexity. The conversion framework includes a local threshold balancing algorithm, which enables efficient calculation of the optimal thresholds and fine-grained adjustment of the threshold value by channel-wise scaling. We also introduce an effective delayed evaluation strategy to mitigate the influence of the spike propagation delays. We demonstrate the scalability of our framework in typical computer vision tasks: image classification, semantic segmentation, object detection, and video classification. Our algorithm outperforms existing methods, highlighting its practical applicability and efficiency. Moreover, we have evaluated the energy consumption of the converted SNNs, demonstrating their superior low-power advantage compared to conventional ANNs. This approach simplifies the deployment of SNNs by leveraging open-source pre-trained ANN models, enabling fast, low-power inference with negligible performance reduction.
Poster
Yufei Guo · Xiaode Liu · Yuanpei Chen · Weihang Peng · Yuhan Zhang · Zhe Ma
[ ExHall D ]
Abstract
Spiking Neural Networks have emerged as a promising energy-efficient alternative to Artificial Neural Networks, utilizing event-driven computation and binary spikes for information transfer. Despite their energy efficiency, SNNs face significant challenges in achieving high task accuracy, particularly when integrated with CNN-based architectures. A potential solution is the combination of Transformer models with SNNs. This paper addresses the challenge of adapting the self-attention mechanism of Transformers to the spiking paradigm by introducing a novel approach: Accurate Addition-Only Spiking Self-Attention (A2OS2A). Unlike existing methods that rely exclusively on binary spiking neurons for all components of the self-attention mechanism, our approach incorporates binary, ReLU, and ternary spiking neurons. This hybrid strategy substantially improves accuracy while maintaining non-multiplicative computations. Furthermore, our method eliminates the need for softmax and scaling operations. Extensive experiments demonstrate that the A2OS2A-based Spiking Transformer outperforms existing SNN-based Transformers on both static and neuromorphic datasets, achieving an accuracy of 78.66\% on ImageNet-1K. Our work represents a significant advancement in SNN-based Transformer models, offering a more accurate and efficient solution for real-world applications.
Poster
Chao Yuan · Guiwei Zhang · Changxiao Ma · Tianyi Zhang · Guanglin Niu
[ ExHall D ]
Abstract
Person re-identification (ReID) aims to extract accurate identity representation features. However, during feature extraction, individual samples are inevitably affected by noise (background, occlusions, and model limitations). Considering that features from the same identity follow a normal distribution around identity centers after training, we propose a Training-Free Feature Centralization ReID framework by aggregating the same identity features to reduce individual sample noise and enhance the stability of identity representation, which preserves the feature's original distribution for following strategies such as re-ranking. Specifically, to obtain samples of the same identity, we introduce two components: Identity-Guided Pedestrian Generation: by leveraging identity features to guide the generation process, we obtain high-quality images with diverse poses, ensuring identity consistency even in complex scenarios such as infrared, and occlusion. Neighbor Feature Centralization: it explores each sample's potential positive samples from its neighborhood. Experiments demonstrate that our generative model exhibits strong generalization capabilities and maintains high identity consistency. With the Feature Centralization framework, we achieve impressive performance even with an ImageNet pre-trained model without ReID training, reaching mAP/Rank-1 of 52.81/78.92 on Market1501. Moreover, our method sets new state-of-the-art results across standard, cross-modality, and occluded ReID tasks, showcasing strong adaptability.
Poster
Keqi Chen · vinkle srivastav · Didier MUTTER · Nicolas Padoy
[ ExHall D ]
Abstract
Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from multi-view synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric features and appearance features for association and decodes the geometric features to predict the 2d positions in the original view. To train the model, we apply Hungarian matching to bridge the gap between instance-wise and image-wise distances, and then utilize synchronization labels for metric learning. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view localization and pairwise edge association. Extensive …
Poster
Jiaqi Zhao · Zeyu Ding · Yong Zhou · Hancheng Zhu · Wen-Liang Du · Rui Yao
[ ExHall D ]
Abstract
The diffusion model has been successfully applied to various detection tasks. However, it still faces several challenges when used for oriented object detection: objects that are arbitrarily rotated require the diffusion model to encode their orientation information; uncontrollable random boxes inaccurately locate objects with dense arrangements and extreme aspect ratios; oriented boxes result in the misalignment between them and image features. To overcome these limitations, we propose ReDiffDet, a framework that formulates oriented object detection as a rotation-equivariant denoising diffusion process. First, we represent an oriented box as a 2D Gaussian distribution, forming the basis of the denoising paradigm. The reverse process can be proven to be rotation-equivariant within this representation and model framework. Second, we design a conditional encoder with conditional boxes to prevent boxes from being randomly placed across the entire image. Third, we propose an aligned decoder for alignment between oriented boxes and image features. The extensive experiments demonstrate ReDiffDet achieves promising performance and significantly outperforms the diffusion model baseline.
Poster
Maochen Yang · Zekun Li · Jian Zhang · Lei Qi · Yinghuan Shi
[ ExHall D ]
Abstract
Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called \textbf{Taste More Taste Better} (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model (VSSM) as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. The source code is provided in the supplementary material.
Poster
Longtao Jiang · Zhendong Wang · Jianmin Bao · Wengang Zhou · Dongdong Chen · Lei Shi · Dong Chen · Houqiang Li
[ ExHall D ]
Abstract
Object removal has so far been dominated by the mask-and-inpain paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths.Experimental results demonstrate that our model, SmartEraser, significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions. We will release the code, dataset, and models.
Poster
Jae-Woo KIM · Ue-Hwan Kim
[ ExHall D ]
Abstract
While current state-of-the-art Scene Change Detection (SCD) approaches achieve impressive results in well-trained research data, they become unreliable under unseen environments and different temporal conditions; in-domain performance drops from 77.6\% to 8.0\% in a previously unseen environment and to 4.6\% under a different temporal condition---calling for generalizable SCD and benchmark. In this work, we propose the Generalizable Scene Change Detection Framework (GeSCF), which addresses unseen domain performance and temporal consistency---to meet the growing demand for anything SCD. Our method leverages the pre-trained Segment Anything Model (SAM) in a zero-shot manner. For this, we design Initial Pseudo-mask Generation and Geometric-Semantic Mask Matching---seamlessly turning user-guided prompt and single-image based segmentation into scene change detection for a pair of inputs without guidance. Furthermore, we define the Generalizable Scene Change Detection (GeSCD) benchmark along with novel metrics and an evaluation protocol to facilitate SCD research in generalizability. In the process, we introduce the ChangeVPR dataset, a collection of challenging image pairs with diverse environmental scenarios---including urban, suburban, and rural settings. Extensive experiments across various datasets demonstrate that GeSCF achieves an average performance gain of 19.2\% on existing SCD datasets and 30.0\% on the ChangeVPR dataset, nearly doubling the prior art performance. We believe our …
Poster
Weixiao Gao · Liangliang Nan · Hugo Ledoux
[ ExHall D ]
Abstract
Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes—offering richer spatial representation—remain underexplored. This paper introduces SUM Parts, the \textbf{first} large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5km^2 with 21 classes. The dataset was created using our designed annotation tool, supporting both face and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset.
Poster
Oliver Hahn · Christoph Reich · Nikita Araslanov · Daniel Cremers · Christian Rupprecht · Stefan Roth
[ ExHall D ]
Abstract
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
Poster
Hongyi Zeng · Wenxuan Liu · Tianhua Xia · Jinhui Chen · Ziyun Li · Sai Qian Zhang
[ ExHall D ]
Abstract
Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing the instance of interest (IOI), reducing computational load and enhancing real-time performance.In this paper, we present a~\textit{foveated instance segmentation} (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.52 on CityScapes and 0.43 on ADE20K, notably outperforming the baseline.
Poster
Yushan Zhang · Aljoša Ošep · Laura Leal-Taixe · Tim Meinhardt
[ ExHall D ]
Abstract
Zero-shot 4D segmentation of arbitrary objects in Lidar is of crucial importance for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar-4D), a method that utilizes multi-modal sensory robotic setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporally consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over 5 PQ, and unlock Zero-Shot 4D LPS.
Poster
Markus Karmann · Onay Urfalioglu
[ ExHall D ]
Abstract
Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels.State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model.In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion.We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain.Pixel-wise counting of the required number of iterations along the Markov chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map.Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions.We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation.Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets.
Poster
Saad Lahlali · Sandra Kara · Hejer AMMAR · Florian Chabot · Nicolas Granger · Hervé Le Borgne · Quoc Cuong PHAM
[ ExHall D ]
Abstract
Object discovery, which refers to the process of localizing objects without human annotations, has gained significant attention in recent years. Despite the growing interest in this task for 2D images, it remains under-explored in 3D data, where it is typically restricted to localizing a single object. Our work leverages the latest advances in 2D object discovery and proposes a novel framework to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we propose DIOD-3D, the first method for multi-object discovery in 3D data, using scene completion as a supporting task to enable dense object discovery from sparse inputs; (ii) we develop xMOD, a cross-modal training framework that integrates both 2D and 3D data, using objective functions tailored to accommodate the sparse nature of 3D data. xMOD uses a teacher-student training across the two modalities to reduce confirmation bias by leveraging the domain gap. During inference, the model supports RGB-only, point cloud-only and multi-modal inputs. We validate the approach in the three settings, on synthetic photo-realistic and real-world datasets. Notably, our approach yields a substantial improvement in F1@50 score compared with the state of the art by 8.7 points in real-world scenarios, demonstrating the potential of …
Poster
Shengqiong Wu · Hao Fei · Jingkang Yang · Xiangtai Li · Juncheng Li · Hanwang Zhang · Tat-seng Chua
[ ExHall D ]
Abstract
The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by an average of 14.62%, highlighting the effectiveness of our method.
Poster
Jaime Corsetti · Francesco Giuliari · Alice Fasoli · Davide Boscaini · Fabio Poiesi
[ ExHall D ]
Abstract
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like ‘turn on the ceiling light,’ an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. …
Poster
Jialin Zhu · Jiangbei Yue · Feixiang He · He Wang
[ ExHall D ]
Abstract
Recently, 3D Gaussian Splatting (3DGS) provides a new framework for novel view synthesis, and has spiked a new wave of research in neural rendering and related applications. As 3DGS is becoming a foundational component of many models, any improvement on 3DGS itself can bring huge benefits. To this end, we aim to improve the fundamental paradigm and formulation of 3DGS. We argue that as an unnormalized mixture model, it needs to be neither Gaussians nor splatting. We subsequently propose a new mixture model consisting of flexible Student's t distributions, with both positive (splatting) and negative (scooping) densities. We name our model Student Splatting and Scooping, or SSS. When providing better expressivity, SSS also poses new challenges in learning. Therefore, we also propose a new principled sampling approach for optimization. Through exhaustive evaluation and comparison, across multiple datasets, settings, and metrics, we demonstrate that SSS outperforms existing methods in terms of quality and parameter efficiency, e.g. achieving matching or better quality with similar numbers of components, and obtaining comparable results while reducing the component number by as much as 82%.
Poster
Jiaxin Shi · Mingyue Xiang · Hao Sun · Yixuan Huang · Zhi Weng
[ ExHall D ]
Abstract
3D Vision Grounding (3DVG) is a fundamental research area that enables agents to perceive and interact with the 3D world. The challenge of the 3DVG task lies in understanding fine-grained semantics and spatial relationships within both the utterance and 3D scene. To address this challenge, we propose a zero-shot neuro-symbolic framework that utilizes a large language model (LLM) as neuro-symbolic functions to ground the object within the 3D Gaussian Splatting (3DGS) representation. By utilizing 3DGS representation, we can dynamically render high-quality 2D images from various viewpoints to enrich the semantic information. Given the complexity of spatial relationships, we construct a relationship graph and chain of semantics that decouple spatial relationships and facilitate step-by-step reasoning within 3DGS representation. Additionally, we employ a grounded-aware self-check mechanism to enable the LLM to reflect on its responses and mitigate the effects of ambiguity in spatial reasoning. We evaluate our method using two publicly available datasets, Nr3D and Sr3D, achieving accuracies of 60.8\% and 91.4\%, respectively. Notably, our method surpasses current state-of-the-art zero-shot methods on the Nr3D dataset. In addition, it outperforms the recent supervised models on the Sr3D dataset.
Poster
Jiangyong Huang · Baoxiong Jia · Yan Wang · Ziyu Zhu · Xiongkun Linghu · Qing Li · Song-Chun Zhu · Siyuan Huang
[ ExHall D ]
Abstract
Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a “mist” that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the “mist”, we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language …
Poster
Qihang Peng · Henry Zheng · Gao Huang
[ ExHall D ]
Abstract
Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing …
Poster
Ronghao Dang · Yuqian Yuan · Wenqi Zhang · Yifei Xin · Boqiang Zhang · Long Li · Liuyi Wang · qinyang zeng · Xin Li · Lidong Bing
[ ExHall D ]
Abstract
The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed.To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies.Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code will be open-sourced.
Poster
Filippo Ziliotto · Tommaso Campari · Luciano Serafini · Lamberto Ballan
[ ExHall D ]
Abstract
Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.
Poster
Xiangyuan Xue · Zeyu Lu · Di Huang · ZiDong Wang · Wanli Ouyang · Lei Bai
[ ExHall D ]
Abstract
Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents’s ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15\% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with …
Poster
Yunzhi Zhang · Zizhang Li · Matt Zhou · Shangzhe Wu · Jiajun Wu
[ ExHall D ]
Abstract
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms an automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing.
Poster
Yongshuo Zong · Qin ZHANG · DONGSHENG An · Zhihua Li · Xiang Xu · Linghan Xu · Zhuowen Tu · Yifan Xing · Onkar Dabeer
[ ExHall D ]
Abstract
In this paper, we present a simple yet effective workflow for automatically scaling instruction-following data to elicit the pixel-level grounding capabilities of VLMs under complex instructions. We address five critical real-world challenges: hallucination, multi-object scenarios, reasoning, multi-granularity, and part-level reference. By distilling visual-language knowledge from a teacher model, our workflow generates instruction-response pairs that link with existing, abundant pixel-level annotations of the images, minimizing the need for human annotation. We refer to the resulting dataset as Ground-V, which captures extensive object localization knowledge and nuanced pixel-level referring expressions. Experimental results show that models of various architectures trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieve an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Poster
Artemis Panagopoulou · Honglu Zhou · silvio savarese · Caiming Xiong · Chris Callison-Burch · Mark Yatskar · Juan Carlos Niebles
[ ExHall D ]
Abstract
Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes.Yet on benchmark visual reasoning data, when answering correctly, such models produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer meant to verify logical correctness of a program produced for a given query.Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models …
Poster
Lei Li · wei yuancheng · Zhihui Xie · Xuqing Yang · Yifan Song · Peiyi Wang · Chenxin An · Tianyu Liu · Sujian Li · Bill Yuchen Lin · Lingpeng Kong · Qi Liu
[ ExHall D ]
Abstract
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models.To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations.Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4\% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs.Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.3\% accuracy for a 7B VL-GenRM).We believe VL-RewardBench along with the experimental insights will become a valuable resource for …
Poster
Xingrui Wang · Wufei Ma · Tiezheng Zhang · Celso M. de Melo · Jieneng Chen · Alan L. Yuille
[ ExHall D ]
Abstract
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.To address this limitation, we present **PulseCheck457**, a scalable and unbiased synthetic dataset designed with **4** key spatial components: multi-object recognition, 2D and 3D spatial relationships, and 3D orientation. **PulseCheck457** supports a cascading evaluation structure, offering **7** question types across **5** difficulty levels that progress from basic single-object recognition to our newly proposed complex 6D spatial reasoning tasks.We evaluated various large multimodal models (LMMs) on **PulseCheck457**, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
Poster
Aayush Dhakal · Srikumar Sastry · Subash Khanal · Adeel Ahmad · Eric Xing · Nathan Jacobs
[ ExHall D ]
Abstract
The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1\% on classification tasks and 0.145 R2 on regression tasks. All our code will be released on GitHub. Our models will be released on HuggingFace.
Poster
Jingyuan Yang · Jiawei Feng · Weibin Luo · Dani Lischinski · Daniel Cohen-Or · Hui Huang
[ ExHall D ]
Abstract
Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses.This task is inherently complex due to its twofold objective: significantly evoking the intended emotion, while preserving the original image composition.Existing AIM methods primarily adjust color and style, often failing to elicit precise and profound emotional shifts.Drawing on psychological insights, we introduce EmoEdit, which extends AIM by incorporating content modifications to enhance emotional impact.Specifically, we first construct EmoEditSet, a large-scale AIM dataset comprising 40,120 paired data through emotion attribution and data construction.To make existing generative models emotion-aware, we design the Emotion adapter and train it using EmoEditSet.We further propose an instruction loss to capture the semantic variations in data pairs.Our method is evaluated both qualitatively and quantitatively, demonstrating superior performance compared to existing state-of-the-art techniques.Additionally, we showcase the portability of our Emotion adapter to other diffusion-based models, enhancing their emotion knowledge with diverse semantics.
Poster
Qu Yang · QingHongYa Shi · Tongxin Wang · Mang Ye
[ ExHall D ]
Abstract
Understanding intention and emotion from social media poses unique challenges due to the inherent uncertainty in multimodal data, where posts often contain incomplete or missing modalities. While this uncertainty reflects real-world scenarios, it remains underexplored within the computer vision community, particularly in conjunction with the intrinsic relationship between emotion and intention. To address these challenges, we introduce the Multimodal IntentioN and Emotion Understanding in the Wild (MINE) dataset, comprising over 20,000 topic-specific social media posts with natural modality variations across text, image, video, and audio. MINE is distinctively constructed to capture both the uncertain nature of multimodal data and the implicit correlations between intentions and emotions, providing extensive annotations for both aspects. To tackle these scenarios, we propose the Bridging Emotion-Intention via Implicit Label Reasoning (BEAR) framework. BEAR consists of two key components: a BEIFormer that leverages emotion-intention correlations, and a Modality Asynchronous Prompt that handles modality uncertainty. Experiments show that BEAR outperforms existing methods in processing uncertain multimodal data while effectively mining emotion-intention relationships for social media content understanding. Dataset and code will be released.
Poster
Size Wu · Sheng Jin · Wenwei Zhang · Lumin Xu · Wentao Liu · Wei Li · Chen Change Loy
[ ExHall D ]
Abstract
Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM---grounding \emph{frozen} off-the-shelf LMMs in human-AI conversations---a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, our F-LMM can be directly …
Poster
Rui Qian · Xin Yin · Dejing Dou
[ ExHall D ]
Abstract
Current Large Multimodal Models (LMMs) empowered tasks such as visual grounding and segmentation typically rely on <SEG> token as a text prompting to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM). However, we observe that little research has looked into how it works when mapping language vocabulary embedding into corresponding vision codebook space. In this work, we first visualize the similarity maps, \aka pseudo images, which are obtained by computing the dot product similarity between the <SEG> token and the image token embedings derived from the last hidden layer in both LLaVA and SAM models. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the pseudo images, which reveals that what <SEG> token contributes to is the semantic correspondences from image-text pairs. Specifically, <SEG> token, a placeholder expanded in text vocabulary, extensively queries within individual tokenized image patches to map the semantics of an object from text to the paired image while the Large Language Models (LLMs) is being fine tined. Upon above findings, we present READ, which facilitates LMMs' resilient REAsoning capability of where to atten\textbf{D} under the guidance of highly activated points …</seg></seg></seg></seg>
Poster
Yanyuan Chen · Dexuan Xu · Yu Huang · Songkun Zhan · Hanpin Wang · Dongxue Chen · Xueping Wang · Meikang Qiu · Hang Li
[ ExHall D ]
Abstract
Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.
Poster
Yuzhong Zhao · Feng Liu · Yue Liu · Mingxiang Liao · Chen GONG · Qixiang Ye · Fang Wan
[ ExHall D ]
Abstract
One important task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. This process essentially constructs a set of region representations, where suitable representations for specific tasks can be matched. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the set of views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is enclosed in the supplementary material.
Poster
Zhen Yang · Zhuo Tao · Qi Chen · Yuankai Qi · Liang Li · Anton van den Hengel · Qingming Huang
[ ExHall D ]
Abstract
Knowledge-based visual question answering (KBVQA) separates image interpretation and knowledge retrieval into separate processes, motivated in part by the fact that they are very different tasks. In this paper, we transform the KBVQA into linguistic question-answering tasks so that we can leverage the rich world knowledge and strong reasoning abilities of Large Language Models (LLMs). The caption-then-question approach to KBVQA has been effective but relies on the captioning method to describe the detail required to answer every possible question. We propose instead a Question-Aware Captioner (QACap), which uses the question as guidance to extract correlated visual information from the image and generate a question-related caption. To train such a model, we utilize GPT-4 to build a corresponding high-quality question-aware caption dataset on top of existing KBVQA datasets. Extensive experiments demonstrate that our QACap model and dataset significantly improve KBVQA performance. Our method, QACap, achieves 68.2\% accuracy on the OKVQA validation set, 73.4\% on the direct-answer part of the A-OKVQA validation set, and 74.8\% on the multiple-choice part, all setting new SOTA benchmarks.
Poster
Hang Hua · Qing Liu · Lingzhi Zhang · Jing Shi · Soo Ye Kim · Zhifei Zhang · Yilin Wang · Jianming Zhang · Zhe Lin · Jiebo Luo
[ ExHall D ]
Abstract
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate integration of visual and textual information across various applications, including image and video captioning, visual question answering, and cross-modal retrieval.Despite their superior capabilities, VLMs still struggle with fine-grained compositional image region descriptions. Specifically, they have difficulty recognizing arbitrary segmentation masks as referential inputs, interpreting compositional aspect instructions for referencing, and precisely describing the compositional aspects of a region. However, compositionality—the ability to understand and generate novel combinations of known visual and textual components—is critical for facilitating coherent reasoning and understanding across modalities in VLMs. To address this issue, we propose OpenCompositionCap, a new dataset for multi-grained region compositional image captioning that distinguishes itself from prior works by introducing the new task of compositional aspect-aware regional image captioning. To support this endeavor, we also introduce a new VLM model, FineCaption. The empirical results illustrate the effectiveness of our proposed model compared with other strong VLMs. In addition, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.
Poster
Yan Li · Yifei Xing · Xiangyuan Lan · Xin Li · Haifeng Chen · Dongmei Jiang
[ ExHall D ]
Abstract
Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
Poster
Yuanmin Tang · Jing Yu · Keke Gai · Jiamin Zhuang · Gang Xiong · Gaopeng Gou · Qi Wu
[ ExHall D ]
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art …
Poster
Bangbang Zhou · Zuan Gao · Zixiao Wang · Boqiang Zhang · Yuxin Wang · Zhineng Chen · Hongtao Xie
[ ExHall D ]
Abstract
Due to the limited scale of multimodal table understanding (MTU) data, model performance is constrained. A straightforward approach is to use multimodal large language models to obtain more samples, but this may cause hallucinations, generate incorrect sample pairs, and cost significantly. To address the above issues, we design a simple yet effective synthesis framework that consists of two independent steps: table image rendering and table question and answer (Q\&A) pairs generation. We use table codes (HTML, LaTeX, Markdown) to synthesize images and generate Q\&A pairs with large language model (LLM). This approach leverages LLM’s high concurrency and low cost to boost annotation efficiency and reduce expenses. By inputting code instead of images, LLMs can directly access the content and structure of the table, reducing hallucinations in table understanding and improving the accuracy of generated Q\&A pairs. Finally, we synthesize a large-scale MTU dataset, SynTab, containing 636K images and 1.8M samples costing within $200 in US dollars. We further introduce a generalist tabular multimodal model, SynTab-LLaVA. This model not only effectively extracts local textual content within the table but also enables global modeling of relationships between cells. SynTab-LLaVA achieves SOTA performance on 21 out of 24 in-domain and out-of-domain benchmarks, demonstrating …
Poster
Daiqing Qi · Handong Zhao · Jing Shi · Simon Jenni · Yifei Fan · Franck Dernoncourt · Scott Cohen · Sheng Li
[ ExHall D ]
Abstract
Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in *William Eggleston’s Guide*, “While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky.” Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component—a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for …
Poster
Jun Chen · Dannong Xu · Junjie Fei · Chun-Mei Feng · Mohamed Elhoseiny
[ ExHall D ]
Abstract
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9\% and 11\% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets will be made publicly available.
Poster
Ryota Tanaka · Taichi Iki · Taku Hasegawa · Kyosuke Nishida · Kuniko Saito · Jun Suzuki
[ ExHall D ]
Abstract
We aim to develop a retrieval-augmented generation (RAG) framework capable of answering questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we present a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance of VDocRAG, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Poster
Linke Ouyang · Yuan Qu · Hongbin Zhou · Jiawei Zhu · Rui Zhang · Qunshu Lin · Bin Wang · Zhiyuan Zhao · Man Jiang · Xiaomeng Zhao · Jin Shi · Fan Wu · Pei Chu · Minghao Liu · Zhenxiang Li · Chao Xu · Bo Zhang · Botian Shi · Zhongying Tu · Conghui He
[ ExHall D ]
Abstract
Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.
Poster
Haoxin Li · Boyang Li
[ ExHall D ]
Abstract
Despite impressive advancements in various multimodal tasks, vision-language models (VLMs) still struggle with compositional understanding due to limited exposure to training samples that contain subtle variations within paired examples. With advances in multimodal generative models, a natural solution is to generate synthetic samples with subtle variations for training VLMs. However, generating and training on synthetic samples with subtle variations presents two challenges: difficulty in accurately creating precise variations and inconsistency in cross-modal alignment quality. To address these challenges, we propose SVD-GT (Subtle Variation Data Generation and Training), which integrates image feature injection into a text-to-image generative model to enhance the quality of synthetic variations and employs an adaptive margin loss to differentiate samples using adaptive margins, which help filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluations on four compositional understanding benchmarks demonstrate that SVD-GT significantly improves the compositionality of VLMs, boosting the average accuracy of CLIP by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.
Poster
Gensheng Pei · Tao Chen · Yujia Wang · Xinhao Cai · Xiangbo Shu · Tianfei Zhou · Yazhou Yao
[ ExHall D ]
Abstract
The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP’s training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions.In this work, we present a concise yet effective approach called Patch Generation-to-Selection (CLIP-PGS) to enhance CLIP’s training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix.Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and …
Poster
Xugong Qin · peng zhang · Jun Jie Ou Yang · Gangyan Zeng · Yubo Li · Yuanyuan Wang · Wanqian Zhang · Pengwen Dai
[ ExHall D ]
Abstract
Scene Text Retrieval (STR) seeks to identify all images containing a given query string. Existing methods typically rely on an explicit Optical Character Recognition (OCR) process of text spotting or localization, which is susceptible to complex pipelines and accumulated errors. To settle this, we resort to the Contrastive Language-Image Pre-training (CLIP) models, which have demonstrated the capacity to perceive and understand scene text, making it possible to achieve strictly OCR-free STR. From the perspective of parameter-efficient transfer learning, a lightweight visual position adapter is proposed to provide a positional information complement for the visual encoder. Besides, we introduce a visual context dropout technique to improve the alignment of local visual features. A novel, parameter-free cross-attention mechanism transfers the contrastive relationship between images and text to that between tokens and text, producing a rich cross-modal representation, which can be utilized for efficient reranking with a linear classifier. The resulting model, CAYN, achieves new state-of-the-art performance on the STR task, with 92.46\%/89.49\%/85.98\% mAP on the SVT/IIIT-STR/TTR datasets at 38.79 FPS on a single GeForce GTX 1080 Ti. Our findings demonstrate that CLIP can serve as a reliable and efficient solution for OCR-free STR, with no more than 0.50M additional parameters required. The …
Poster
Rui Xiao · Sanghwan Kim · Iuliana Georgescu · Zeynep Akata · Stephan Alaniz
[ ExHall D ]
Abstract
CLIP has shown impressive results in aligning images and text at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code and model checkpoints will be released upon acceptance.
Poster
Yuheng Feng · Changsong Wen · Zelin Peng · Li jiaye · Siyu Zhu
[ ExHall D ]
Abstract
Contrastive language-image pretraining models like CLIP have shown strong performance in various text-image alignment tasks. However, CLIP’s 77-token input limit and short-text training data restrict its effectiveness in long-text tasks. To address these limitations, we introduce LongD-CLIP, a dual-teacher distillation framework that enhances long-text representation while preventing knowledge forgetting. In our approach, a teacher model fine-tuned on long-text data distills rich representation knowledge into the student model, while the original CLIP model serves as a secondary teacher to help the student retain foundational knowledge. Experimental results show that LongD-CLIP achieves substantial improvements across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP outperforms Long-CLIP by 2.5%, achieving 98.3%. On the Urban-1k dataset, it shows a 9.2% improvement, reaching 91.9%, which demonstrates its robust generalization ability. Additionally, LongD-CLIP’s text encoder exhibits reduced drift in latent space and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.
Poster
Dahyun Kang · Piotr Bojanowski · Huy V. Vo · Théo Moutakanni · Cijo Jose · Federico Baldassarre · Patrick Labatut · Michael Ramamonjisoa · Maxime Oquab · Timothée Darcet · Hu Xu · Shang-Wen Li · Oriane Simeoni · Marc Szafraniec
[ ExHall D ]
Abstract
Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named d∈⊙txt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model, but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks,such as concatenating the [CLS] token with the patch average to train the alignment, curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
Poster
Davide Berasi · Matteo Farina · Massimiliano Mancini · Elisa Ricci · Nicola Strisciuglio
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data.We propose a framework, called Geodesically Decomposable Embeddings (GDE), that addresses these problems and approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable.
Poster
Jiuhai Chen · Jianwei Yang · Haiping Wu · Dianqi Li · Jianfeng Gao · Tianyi Zhou · Bin Xiao
[ ExHall D ]
Abstract
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose depth-breath fusion (DBFusion)'' to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models …
Poster
Chenyu Yang · Xuan Dong · Xizhou Zhu · Weijie Su · Jiahao Wang · Hao Tian · Zhe Chen · Wenhai Wang · Lewei Lu · Jifeng Dai
[ ExHall D ]
Abstract
Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.
Poster
Yaqi Zhao · Yuanyang Yin · Lin Li · Mingan Lin · Victor Shea-Jay Huang · Siwei Chen · Weipeng Chen · Baoqun Yin · Zenan Zhou · Wentao Zhang
[ ExHall D ]
Abstract
Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model’s interpretive range.To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data—images whose ambiguous visual representations challenge the VE’s interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM’s capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment.Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight …
Poster
Luo · Xue Yang · Wenhan Dou · Zhaokai Wang · Jiawen Liu · Jifeng Dai · Yu Qiao · Xizhou Zhu
[ ExHall D ]
Abstract
In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency. …
Poster
Xubing Ye · Yukang Gan · Yixiao Ge · Xiao-Ping Zhang · Yansong Tang
[ ExHall D ]
Abstract
Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strategy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input instance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks.
Poster
Dominik Schnaus · Nikita Araslanov · Daniel Cremers
[ ExHall D ]
Abstract
The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e., without parallel data. We present the first study towards this prospect, and investigate conformity of existing vision and language foundation models in the context of "blind" matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens possibility for exciting applications embedding semantic knowledge into other modalities. As a showcase, we demonstrate a proof-of-concept unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.
Poster
Kun Zhang · Jingyu Li · Zhe Li · S Kevin Zhou
[ ExHall D ]
Abstract
Vision-Language (VL) alignment across image and text modalities is a challenging task due to the inherent semantic ambiguity of data with multiple possible meanings. Existing methods typically solve it by learning multiple sub-representation spaces to encode each input data as a set of embeddings, and constraining diversity between whole subspaces to capture diverse semantics for accurate VL alignment. Despite their promising outcomes, existing methods suffer two imperfections: 1) actually, specific semantics is mainly expressed by some local dimensions within the subspace. Ignoring this intrinsic property, existing diversity constraints imposed on the whole subspace may impair diverse embedding learning; 2) multiple embeddings are inevitably introduced, sacrificing computational and storage efficiency. In this paper, we propose a simple yet effective Diverse and Hybrid Set-embeddings learning framework (DH-Set), which is distinct from prior work in three aspects. DH-Set 1) devises a novel semantic importance dissecting method to focus on key local dimensions within each subspace; and thereby 2) not only imposes finer-grained diversity constraint to improve the accuracy of diverse embedding learning, 3) but also mixes key dimensions of all subspaces into the single hybrid embedding to boost inference efficiency. Extensive experiments on various benchmarks and model backbones show the superiority of DH-Set …
Poster
Zhangqi Jiang · Junkai Chen · Beier Zhu · Tingjin Luo · Yankun Shen · Xu Yang
[ ExHall D ]
Abstract
Hallucinations in Large Vision-Language Models (LVLMs) significantly undermine their reliability, motivating researchers to explore the causes of hallucination. However, most studies primarily focus on the language aspect rather than the visual. In this paper, we address how LVLMs process visual information and whether this process causes hallucination. Firstly, we use the attention lens to identify the stages at which LVLMs handle visual data, discovering that the middle layers are crucial. Moreover, we find that these layers can be further divided into two stages: "visual information enrichment" and "semantic refinement" which respectively propagate visual data to object tokens and interpret it through text. By analyzing attention patterns during the visual information enrichment stage, we find that real tokens consistently receive higher attention weights than hallucinated ones, serving as a strong indicator of hallucination. Further examination of multi-head attention maps reveals that hallucination tokens often result from heads interacting with inconsistent objects. Based on these insights, we propose a simple inference-time method that adjusts visual attention by integrating information across various heads. Extensive experiments demonstrate that this approach effectively mitigates hallucinations in mainstream LVLMs without additional training costs. Our code will be released at: https://anonymous.4open.science/r/middle_layers_indicating_hallucinations-C45A.
Poster
Yuncheng Guo · Xiaodong Gu
[ ExHall D ]
Abstract
Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders—where dataset-specific features are more prominent—while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL …
Poster
Zixuan Hu · Yongxian Wei · Li Shen · Chun Yuan · Dacheng Tao
[ ExHall D ]
Abstract
Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.
Poster
Soumya Suvra Ghosal · Souradip Chakraborty · Vaibhav Singh · Tianrui Guan · Mengdi Wang · Ahmad Beirami · Furong Huang · Alvaro Velasquez · Dinesh Manocha · Amrit Singh Bedi
[ ExHall D ]
Abstract
With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks—carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.
Poster
Yue Cao · Yun Xing · Jie Zhang · Di Lin · Tianwei Zhang · Ivor Tsang · Yang Liu · Qing Guo
[ ExHall D ]
Abstract
Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent.Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration.The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image.This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications.Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while …
Poster
Zhaoyi Liu · Huan Zhang
[ ExHall D ]
Abstract
Self-supervised learning (SSL) vision encoders learn high-quality image representations and thus have become a vital part of developing vision modality of large vision language models (LVLMs). Due to the high cost of training such encoders, pre-trained encoders are widely shared and deployed into many LVLMs, which are security-critical or bear societal significance. Under this practical scenario, we reveal a new backdoor threat that significant visual hallucinations can be induced into these LVLMs by merely compromising vision encoders. Because of the sharing and reuse of these encoders, many downstream LVLMs may inherit backdoor behaviors from encoders, leading to widespread backdoors. In this work, we propose BadVision, the first method to exploit this vulnerability in SSL vision encoders for LVLMs with novel trigger optimization and backdoor learning techniques. We evaluate BadVision on two types of SSL encoders and LVLMs across eight benchmarks. We show that BadVision effectively drives the LVLMs to attacker-chosen hallucination with over 99\% attack success rate, causing a 77.6\% relative visual understanding error while maintaining the stealthiness. SoTA backdoor detection methods cannot detect our attack effectively.
Poster
Yuchen Ren · Zhengyu Zhao · Chenhao Lin · Bo Yang · Lu Zhou · Zhe Liu · Chao Shen
[ ExHall D ]
Abstract
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement method by up to 7.0\% on average.We also validate its superior against popular defenses and its compatibility with other transfer methods.
Poster
Jenny Schmalfuss · Nadine Chang · Vibashan VS · Maying Shen · Andrés Bruhn · Jose M. Alvarez
[ ExHall D ]
Abstract
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems.VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive.Hence, it is crucial determining whether VLMs inherit this instability to varying prompts.We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations.To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis.Regarding prompt variations, experimental results from PARC show that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations are those that change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family.We further find indications that prompt sensitivity is linked more closely to training data than to model size.Code and datasets will be released.
Poster
Yassir Bendou · Amine Ouasfi · Vincent Gripon · Adnane Boukhayma
[ ExHall D ]
Abstract
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Leveraging this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing over the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, that we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performance across 11 datasets in the standard few-shot adaptation benchmark.
Poster
Maxime Zanella · Clément Fuchs · Christophe De Vleeschouwer · Ismail Ben Ayed
[ ExHall D ]
Abstract
The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models’ initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code will be made available.
Poster
Dengyang Jiang · Haoyu Wang · Lei Zhang · Wei Wei · Guang Dai · Mengmeng Wang · Jingdong Wang · Yanning Zhang
[ ExHall D ]
Abstract
Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations, have proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit non-trivial bias, which is not only non-transferable across either categories or domains, but also inevitably memorized by the backbone, thus causing its generalization capacity degeneration. To mitigate this problem, we present an \textbf{u}n\textbf{b}iased general annotated dataset \textbf{gen}eration framework (\textbf{ubGen}). Instead of expensive manual collection, we aim at directly generating synthetic unbiased images with category annotations. To achieve this goal, we propose to leverage the advantage of multimodal foundation model (e.g., CLIP), in terms of aligning images with language in an unbiased semantic space. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into an quality assurance loss to preserve the quality of the generated image. By leveraging these …
Poster
Chaoyang Li · Jianyang Qin · Jinhao Cui · Zeyu Liu · Ning Hu · Qing Liao
[ ExHall D ]
Abstract
Multi-task prompt learning has emerged as a promising technique for fine-tuning pre-trained Vision-Language Models (VLMs) to various downstream tasks. However, existing methods ignore challenges caused by spurious correlations and dynamic task relationships, which may reduce the model performance. To tackle these challenges, we propose JSCPT, a novel approach for \textit{Joint Scheduling of Causal Prompts and Tasks} to enhance multi-task prompt learning. Specifically, we first design a \textit{Multi-Task Vison-Language Prompt} (MTVLP) model, which learns task-shared and task-specific vison-language prompts and selects useful prompt features via causal intervention, alleviating spurious correlations. Then, we propose the task-prompt scheduler that models inter-task affinities and assesses the causal effect of prompt features to optimize the multi-task prompt learning process. Finally, we formulate the scheduler and the multi-task prompt learning process as a bi-level optimization problem to optimize prompts and tasks adaptively. In the lower optimization, MTVLP is updated with the scheduled gradient, while in the upper optimization, the scheduler is updated with the implicit gradient. Extensive experiments show the superiority of our proposed JSCPT approach over several baselines in terms of multi-task prompt learning for pre-trained VLMs.
Poster
Hairui Ren · Fan Tang · He Zhao · Zixuan Wang · Dandan Guo · Yi Chang
[ ExHall D ]
Abstract
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods.In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification.Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment.Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
Poster
Xiangyan Qu · Gaopeng Gou · Jiamin Zhuang · Jing Yu · Kun Song · Qihao Wang · Yili Li · Gang Xiong
[ ExHall D ]
Abstract
Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our …
Poster
Jinpeng Wang · Tianci Luo · Yaohua Zha · Yan Feng · Ruisheng Luo · Bin Chen · Tao Dai · Long Chen · Yaowei Wang · Shu-Tao Xia
[ ExHall D ]
Abstract
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: ***prompt condensation***. ather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone and an extra pre-alignment objective, Condenser ensures stability and accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code will be open-sourced at https://anonymous.4open.science/r/VICL-Condenser.
Poster
Jung-Ho Hong · Ho-Joong Kim · Kyu-Sung Jeon · Seong-Whan Lee
[ ExHall D ]
Abstract
The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.
Poster
Jungsoo Lee · Debasmit Das · Munawar Hayat · Sungha Choi · Kyuwoong Hwang · Fatih Porikli
[ ExHall D ]
Abstract
We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and …
Poster
Debora Caldarola · Pietro Cagnasso · Barbara Caputo · Marco Ciccone
[ ExHall D ]
Abstract
Federated learning (FL) enables collaborative model training with privacy preservation. Data heterogeneity across edge devices (clients) can cause models to converge to sharp minima, negatively impacting generalization and robustness. Recent approaches use client-side sharpness-aware minimization (SAM) to encourage flatter minima, but the discrepancy between local and global loss landscapes often undermines their effectiveness, as optimizing for local sharpness does not ensure global flatness. This work introduces FedGloSS (Federated Global Server-side Sharpness), a novel FL approach that prioritizes the optimization of global sharpness on the server, using SAM. To reduce communication overhead, FedGloSS cleverly approximates sharpness using the previous global gradient, eliminating the need for additional client communication. Our extensive evaluations demonstrate that FedGloSS consistently reaches flatter minima and better performance compared to state-of-the-art FL methods across various federated vision benchmarks.
Poster
Shunxin Wang · Raymond Veldhuis · Nicola Strisciuglio
[ ExHall D ]
Abstract
Frequency shortcuts refer to specific frequency patterns that models heavily rely on for correct classification. Previous studies have shown that models trained on small image datasets often exploit such shortcuts, potentially impairing their generalization performance. However, existing methods for identifying frequency shortcuts require expensive computations and become impractical for analyzing models trained on large datasets. In this work, we propose the first approach to more efficiently analyze frequency shortcuts at a larger scale. We show that both CNN and transformer models learn frequency shortcuts on ImageNet. We also expose that frequency shortcut solutions can yield good performance on out-of-distribution (OOD) test sets which largely retain texture information. However, these shortcuts, mostly aligned with texture patterns, hinder model generalization on rendition-based OOD test sets. These observations suggest that current OOD evaluations often overlook the impact of frequency shortcuts on model generalization. Future benchmarks could thus benefit from explicitly assessing and accounting for these shortcuts to build models that generalize across a broader range of OOD scenarios.
Poster
Ningyuan Tang · Minghao Fu · Jianxin Wu
[ ExHall D ]
Abstract
The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on devices with low computational resources. We explore a new visual adaptation paradigm called separated tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on devices which possess only low computational resources (slow CPU, no GPU, small memory, etc.) Existing methods that are potentially suitable for our separated tuning paradigm are discussed. But, three major drawbacks hinder their application in separated tuning: low adaptation capability, large adapter network, and in particular, high information transfer overhead. To address these issues, we propose Minimal Interaction Separated Tuning, or MIST, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIST achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.
Poster
Krishna Sri Ipsit Mantri · Carola-Bibiane Schönlieb · Bruno Ribeiro · Chaim Baskin · Moshe Eliasof
[ ExHall D ]
Abstract
Pre-trained Vision Transformers now serve as powerful tools for computer vision. Yet, efficiently adapting them for multiple tasks remains a challenge that arises from the need to modify the rich hidden representations encoded by the learned weight matrices, without inducing interference between tasks. Current parameter-efficient methods like LoRA, which apply low-rank updates, force tasks to compete within constrained subspaces, ultimately degrading performance. We introduce DiTASK, a novel Diffeomorphic Multi-Task Fine-Tuning approach that maintains pre-trained representations by preserving weight matrix singular vectors, while enabling task-specific adaptations through neural diffeomorphic transformations of the singular values. By following this approach, DiTASK enables both shared and task-specific feature modulations with minimal added parameters. Our theoretical analysis shows that DiTASK achieves full-rank updates during optimization, preserving the geometric structure of pre-trained features, and establishing a new paradigm for efficient multi-task learning (MTL). Our experiments on PASCAL MTL and NYUD show that DiTASK achieves state-of-the-art performance across four dense prediction tasks, using 75% fewer parameters than existing methods.
Poster
Jian Meng · Ahmed Hasssan · Li Yang · Deliang Fan · Jinwoo Shin · Jae-sun Seo
[ ExHall D ]
Abstract
Learning the visual representation via masked auto-encoder (MAE) training has been proven to be a powerful technique. Transferring the pre-trained vision transformer (ViT) to downstream tasks leads to superior performance compared to conventional task-by-task supervised learning. Recent research works on MAE focus on large-sized vision transformers(>50 million parameters) with outstanding performance. However, improving the generality of the under-parametrized lightweight model has been widely ignored. In practice, downstream applications are commonly intended for resource-constrained platforms, where large-scale ViT cannot easily meet the resource budget. Current lightweight MAE training heavily relies on knowledge distillation with a pre-trained teacher, whereas the root cause behind the poor performance remains under-explored. Motivated by that, this paper first introduces the concept of closest neighbor patch'' to characterize the local semantics among the input tokens. Our discovery shows that the lightweight model failed to distinguish different local information, leading to aliased understanding and poor accuracy. Motivated by this finding, we propose NoR-MAE, a novel MAE training algorithm for lightweight vision transformers. NoR-MAE elegantly repels the semantic aliasing between patches and their closest neighboring patch (semantic centroid) with negligible training cost overhead. With the ViT-Tiny model, NoR-MAE achieves up to 7.22%/3.64% accuracy improvements on ImageNet-100/ImageNet-1K datasets, as …
Poster
Mengqiao Han · Liyuan Pan · Xiabi Liu
[ ExHall D ]
Abstract
Neural networks derived from the M-P model have excelled in various visual tasks. However, as a simplified simulation version of the brain neural pathway, their structures are locked during training, causing over-fitting and over-parameterization. Although recent models have begun using the biomimetic concept and empirical pruning, they still result in irrational pruning, potentially affecting the accuracy of the model. In this paper, we introduce the Glia unit, composed of oligodendrocytes (Oli) and astrocytes (Ast), to emulate the exact workflow of the mammalian brain, thereby enhancing the biological plausibility of neural functions. Oli selects neurons involved in signal transmission during neural communication and, together with Ast, adaptively optimizes the neural structure. Specifically, we first construct the artificial Glia-Neuron (G-N) model, which is formulated at the instance, group, and interaction levels with adaptive and collaborative mechanisms. Then, we construct GliaNet based on our G-N model, whose structure and connections can be continuously optimized during training. Experiments show that our GliaNet advances state-of-the-art on multiple tasks while significantly reducing its parameters.
Poster
Quentin Bouniot · Ievgen Redko · Anton Mallasto · Charlotte Laclau · Oliver Struckmeier · Karol Arndt · Markus Heinonen · Ville Kyrki · Samuel Kaski
[ ExHall D ]
Abstract
In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width -- common factors associated with their expressive power -- may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport mappings, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications.
Poster
Ali Hatamizadeh · Jan Kautz
[ ExHall D ]
Abstract
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://anonymous.4open.science/r/mamba_vision-D073
Poster
Qihang Fan · Huaibo Huang · Ran He
[ ExHall D ]
Abstract
The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the kv buffer and the output features. Consequently, we introduce **Rank-Augmented Linear Attention** (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the **Rank-Augmented Vision Linear Transformer** (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an **84.4%** Top-1 accuracy on ImageNet-1k with only **26M** parameters and **4.6G** FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA.
Poster
Dachong Li · li li · zhuangzhuang chen · Jianqiang Li
[ ExHall D ]
Abstract
Large kernels play a crucial role in enhancing the performance of standard convolutional neural networks (CNNs), enabling CNNs to outperform transformer architectures in computer vision. Scaling up kernel size has significantly contributed to the advancement of CNN models like RepLKNet, SLaK and UniRepLKNet. However, the relationship between kernel size and model performance varies across these work. It implies that large kernel convolution may involve hidden factors that affect model performance. Instead of merely increasing the kernel size, we reassess the role of large convolutions and decompose them into two separate components: extracting features at a certain granularity and fusing features by multiple pathways. In this paper, we contribute from two aspects. 1) We demonstrate that 3×3 convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects. 2) We develop a multi-path long-distance sparse dependency relationship to enhance feature utilization. Specifically, we introduce the Shiftwise (SW) convolution operator, a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. Code and all the models at \url{https://anonymous.4open.science/r/shift-wiseConv-8978}.
Poster
Zelin Peng · Yu Huang · Zhengqin Xu · feilong tang · Ming Hu · Xiaokang Yang · Wei Shen
[ ExHall D ]
Abstract
Contextual modeling is crucial for robust visual representation learning, especially in computer vision. Although Transformers have become a leading architecture for vision tasks due to their attention mechanism, the quadratic complexity of full attention operations presents substantial computational challenges. To address this, we introduce Star with Bilinear Mapping (SBM), a Transformer-like architecture that achieves global contextual modeling with linear complexity. SBM employs a bilinear mapping module (BM) with low-rank decomposition strategy and star operations (element-wise multiplication) to efficiently capture global contextual information. Our model demonstrates competitive performance on image classification and semantic segmentation tasks, delivering significant computational efficiency gains compared to traditional attention-based models.
Poster
Tommie Kerssies · Niccolò Cavagnero · Alexander Hermans · Narges Norouzi · Giuseppe Averta · Bastian Leibe · Gijs Dubbelman · Daan de Geus
[ ExHall D ]
Abstract
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. Currently, to apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that leverages them to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Leveraging these findings, we introduce the Encoder-only Mask Transformer, which repurposes the plain ViT architecture to conduct image segmentation. Using large models and strong pre-training, EoMT obtains a segmentation performance similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4× faster using ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation performance and inference speed, suggesting that compute resources are better allocated to scaling the ViT itself rather than adding architectural complexity. Code will be released upon acceptance.
Poster
Jiahao He · Keren Fu · Xiaohong Liu · Qijun Zhao
[ ExHall D ]
Abstract
Existing salient object detection (SOD) models primarily resort to convolutional neural networks (CNNs) and Transformers. However, the limited receptive fields of CNNs and quadratic computational complexity of transformers both constrain the performance of current models on discovering attention-grabbing objects. The emerging state space model, namely Mamba, has demonstrated its potential to balance global receptive fields and computational complexity. Therefore, we propose a novel unified framework based on the pure Mamba architecture, dubbed saliency Mamba (Samba), to flexibly handle general SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), and RGB-D VSOD. Specifically, we rethink Mamba's scanning strategy from the perspective of SOD, and identify the importance of maintaining spatial continuity of salient patches within scanning sequences. Based on this, we propose a saliency-guided Mamba block (SGMB), incorporating a spatial neighboring scanning (SNS) algorithm to preserve spatial continuity of salient patches. Additionally, we propose a context-aware upsampling (CAU) method to promote hierarchical feature alignment and aggregations by modeling contextual dependencies. Experimental results show that our Samba outperforms existing methods across five SOD tasks on 21 datasets with lower computational cost, confirming the superiority of introducing Mamba to the SOD areas. Our code will be made publicly available.
Poster
Pei Geng · Jian Yang · Shanshan Zhang
[ ExHall D ]
Abstract
Human-Object Interaction (HOI) detection aims to predict the <Human, Interaction, Object> triplets, where the core challenge lies in recognizing the interaction of each human-object pair. Despite recent progress thanks to more advanced model architectures, HOI performance remains unsatisfactory. In this work, we first perform some failure analysis and find that the accuracy for the no-interaction category is extremely low, largely hindering the improvement of overall performance. We further look into the error types and find the mis-classification between no-interaction and with-interaction ones can be handled by human-object relation priors. Specifically, to better distinguish no-interaction from direct interactions, we propose 3D location prior, which indicates the distance between human and object; as of no-interaction vs. indirect interactions, we propose gaze area prior, which denotes whether human can see the object or not. The above two types of human-object relation priors are represented by text and are combined with the original visual features, generating multi-modal cues for interaction recognition.Experimental results on the HICO-DET and V-COCO datasets demonstrate that our proposed human-object relation priors are effective and our method HORP surpasses previous methods under various settings and scenarios. In particular, the usage of our priors significantly enhances the model's recognition ability for the no-interaction …</$human,>
Poster
Yifei Qian · Zhongliang Guo · Bowen Deng · Chun Tong Lei · Shuai Zhao · Chun Pong Lau · Xiaopeng Hong · Michael Pound
[ ExHall D ]
Abstract
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a one-step diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code will be made publicly available.
Poster
Ziyu Zhao · Xiaoguang Li · Lingjia Shi · Nasrin Imanpour · Song Wang
[ ExHall D ]
Abstract
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy.To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.
Poster
Srinivasa Rao Nandam · Sara Atito · Zhenhua Feng · Josef Kittler · Muhammad Awais
[ ExHall D ]
Abstract
Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6\% (1-way 1-shot) and +8.0/6.5\% (2-way 1-shot) on COCO-20i, and +2.2/3.8\% (1-way 1-shot) and +4.3/4.0\% (2-way 1-shot) on Pascal-5i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. …
Poster
Guoyu Yang · Yuan Wang · Daming Shi · Yanzhong Wang
[ ExHall D ]
Abstract
Recent real-time semantic segmentation models, whether single-branch or multi-branch, achieve good performance and speed. However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi-convolutions and horizontal multi-paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self-enlarge during training and self-contract during inference, effectively becoming a teacher model" without needing external ones. Experimental results show that GCNet outperforms existing state-of-the-art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at x.
Poster
Hyeokjun Kweon · Kuk-Jin Yoon
[ ExHall D ]
Abstract
Instance segmentation traditionally relies on dense pixel-level annotations, making it costly and labor-intensive. To alleviate this burden, weakly supervised instance segmentation utilizes cost-effective weak labels, such as image-level tags, points, and bounding boxes. However, existing approaches typically focus on a single type of weak label, overlooking the cost-efficiency potential of combining multiple types. In this paper, we introduce WISH, a novel heterogeneous framework for weakly supervised instance segmentation that integrates diverse weak label types within a single model. WISH unifies heterogeneous labels by leveraging SAM’s prompt latent space through a multi-stage matching strategy, effectively compensating for the lack of spatial information in class tags. Extensive experiments on Pascal VOC and COCO demonstrate that our framework not only surpasses existing homogeneous weak supervision methods but also achieves superior results in heterogeneous settings with equivalent annotation costs.
Poster
Can Küçüksözen · Yucel Yemez
[ ExHall D ]
Abstract
We propose the Compact Clustering Attention (COCA) layer, an effective building block that introduces a hierarchical strategy for object-centric representation learning while solving the unsupervised object discovery task on single images. COCA is an attention-based clustering module capable of extracting object-centric representations from multi-object scenes, when cascaded into a bottom-up hierarchical network architecture, referred to as COCA-Net. At its core, COCA utilizes a novel clustering algorithm that leverages the physical concept of compactness to highlight distinct object centroids in a scene, providing a spatial inductive bias. Thanks to this strategy, COCA-Net generates high-quality segmentation masks on both the decoder side and, notably, the encoder side of its pipeline. Additionally, COCA-Net is not bound by a predetermined number of object masks that it generates and handles the segmentation of background elements better than its competitors. We demonstrate COCA-Net's segmentation performance on six widely adopted datasets, achieving superior or competitive results against the state-of-the-art models across nine different evaluation metrics.
Poster
Mingfu Liang · Jiahuan Zhou · Xu Zou · Ying Wu
[ ExHall D ]
Abstract
Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called \textbf{I}ncremental object \textbf{K}eypoint \textbf{L}earning~(IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first \textit{Knowledge Association} stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second \textit{Mutual Promotion} stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method …
Poster
Shuo Li · Fang Liu · Zehua Hao · Xinyi Wang · Lingling Li · Xu Liu · Puhua Chen · Wenping Ma
[ ExHall D ]
Abstract
With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. First, MAF extracts features from different levels of the CLIP image encoder and fuses them uniformly to enhance feature representation. Second, ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results on multiple benchmarks show that our method can significantly improve the classification performance and alleviate the category confusion problem.
Poster
Luyao Tang · Chaoqi Chen · Yuxuan Yuan · Zeyu Zhang · Yue Huang · Kun Zhang
[ ExHall D ]
Abstract
Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret, and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.
Poster
Lei-Lei Ma · Shuo Xu · Ming-Kun Xie · Lei Wang · Dengdi Sun · Haifeng Zhao
[ ExHall D ]
Abstract
Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-level visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group level to the label level, adaptively obtaining label-level representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed …
Poster
Qiyuan Dai · Hanzhuo Huang · Yu Wu · Sibei Yang
[ ExHall D ]
Abstract
Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.
Poster
Xing Xi · Yangyang Huang · Ronghua Luo · Yu Qiu
[ ExHall D ]
Abstract
Open world perception expands traditional closed-set frameworks, which assume a predefined set of known categories, to encompass dynamic real-world environments. Open World Object Detection (OWOD) and Open Vocabulary Object Detection (OVD) are two main research directions, each addressing unique challenges in dynamic environments. However, existing studies often focus on only one of these tasks, leaving the combined challenges of OWOD and OVD largely underexplored. In this paper, we propose a novel detector, OW-OVD, which inherits the zero-shot generalization capability of OVD detectors while incorporating the ability to actively detect unknown objects and progressively optimize performance through incremental learning, as seen in OWOD detectors. To achieve this, we start with a standard OVD detector and adapt it for OWOD tasks. For attribute selection, we propose the Visual Similarity Attribute Selection (VSAS) method, which identifies the most generalizable attributes by computing similarity distributions across annotated and unannotated regions. Additionally, to ensure the diversity of attributes, we incorporate a similarity constraint in the iterative process. Finally, to preserve the standard inference process of OVD, we propose the Hybrid Attribute-Uncertainty Fusion (HAUF) method. This method combines attribute similarity with known class uncertainty to infer the likelihood of an object belonging to an unknown class. …
Poster
Haochen Li · Rui Zhang · Hantao Yao · Xin Zhang · Yifan Hao · Xinkai Song · Shaohui Peng · Yongwei Zhao · Zhao Chen · Yanjun Wu · Ling Li
[ ExHall D ]
Abstract
Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. Traditional works focus on aligning visual features between domains to extract domain-invariant knowledge, and recent VLM-based DAOD methods leverage semantic information provided by the textual encoder to supplement domain-specific features for each domain.However, they overlook the role of semantic information in guiding the learning of visual features that are beneficial for adaptation.To solve the problem, we propose semantic entropy to quantify the semantic information contained in visual features, and design SEmantic ENtropy guided Domain-aware Attention (SEEN-DA) to adaptively refine visual features with the semantic information of two domains.Semantic entropy reflects the importance of features based on semantic information, which can serve as attention to select discriminative visual features and suppress semantically irrelevant redundant information.Guided by semantic entropy, we introduce domain-aware attention modules into the visual encoder in SEEN-DA.It utilizes an inter-domain attention branch to extract domain-invariant features and eliminate redundant information, and an intra-domain attention branch to supplement the domain-specific semantic information discriminative on each domain.Comprehensive experiments validate the effectiveness of SEEN-DA, demonstrating significant improvements in cross-domain object detection performance.
Poster
Zhaohu Xing · Lihao Liu · Yijun Yang · Hongqiu Wang · Tian Ye · Sixiang Chen · Wenxue Li · Guang Liu · Lei Zhu
[ ExHall D ]
Abstract
Mirror detection is a challenging task because a mirror's visual appearance varies depending on the reflected content. Due to limited annotated data, current methods failed to generalize well for detecting diverse mirror scenes. Semi-supervised learning with large-scale unlabeled data can improve generalization capabilities on mirror detection, but these methods often suffer from unreliable pseudo-labels due to distribution differences between labeled and unlabeled data, therefore affecting the learning process. To address this issue, we first collect a large-scale dataset of approximately 0.4 million mirror-related images from the internet, significantly expanding the data scale for mirror detection. To effectively exploit this unlabeled dataset, we propose the first semi-supervised framework (namely an iterative data engine) consisting of four steps: (1) mirror detection model training, (2) pseudo label prediction, (3) dual guidance scoring, and (4) selection of highly reliable pseudo labels. In each iteration of the data engine, we employ a geometric accuracy scoring approach to assess pseudo labels based on multiple segmentation metrics, and design a multi-modal agent-driven semantic scoring approach to enhance the semantic perception of pseudo labels. These two scoring approaches can effectively improve the reliability of pseudo labels by selecting unlabeled samples with higher scores. Our method demonstrates promising performance …
Poster
Beier Zhu · Jiequan Cui · Hanwang Zhang · Chi Zhang
[ ExHall D ]
Abstract
While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach--Project-Probe-Aggregate (PPA)--that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm.Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.
Poster
Kai Zhao · zhihao zhuang · Miao Zhang · Chenjuan Guo · Yang Shu · Bin Yang
[ ExHall D ]
Abstract
Model quantization is an effective way to compress deep neural networks and accelerate the inference time on edge devices. Existing quantization methods usually require original data for calibration during the compressing process, which may be inaccessible due to privacy issues. A common way is to generate calibration data to mimic the origin data. However, the generators in these methods have the mode collapse problem, making them unable to synthesize diverse data. To solve this problem, we leverage the information from the full-precision model and enhance both inter-class and intra-class diversity for generating better calibration data, by devising a multi-layer features mixer and normalization flow based attention. Besides, novel regulation losses are proposed to make the generator produce diverse data with more patterns from the perspective of activated feature values and for the quantized model to learn better clip ranges adaptive to our diverse calibration data. Extensive experiments show that our method achieves state-of-the-art quantization results for both Transformer and CNN architectures. In addition, we visualize the generated data to verify that our strategies can effectively handle the mode collapse issue. Our codes are available at https://anonymous.4open.science/r/DFQ-84E6 and will be publicly available.
Poster
Zhou Yang · Mingtao Feng · Tao Huang · Fangfang Wu · Weisheng Dong · Xin Li · Guangming Shi
[ ExHall D ]
Abstract
Recent approaches, such as data augmentation, adversarial training, and transfer learning, have shown potential in addressing the issue of performance degradation caused by distributional shifts. However, they typically demand careful design in terms of data or models and lack awareness of the impact of distributional shifts. In this paper, we observe that classification errors arising from distribution shifts tend to cluster near the true values, suggesting that misclassifications commonly occur in semantically similar, neighboring categories. Furthermore, robust advanced vision foundation models maintain larger inter-class distances while preserving semantic consistency, making them less vulnerable to such shifts. Building on these findings, we propose a new method called GFN (Gain From Neighbors), which uses gradient priors from neighboring classes to perturb input images and incorporates an inter-class distance-weighted loss to improve class separation. This approach encourages the model to learn more resilient features from data prone to errors, enhancing its robustness against shifts in diverse settings. In extensive experiments across various model architectures and benchmark datasets, GFN consistently demonstrated superior performance. For instance, compared to the current state-of-the-art TAPADL method, our approach achieved a higher corruption robustness of 41.4% on ImageNet-C (+2.3%), without requiring additional parameters and using only minimal data.
Poster
HAN SUN · Yunkang Cao · Hao Dong · Olga Fink
[ ExHall D ]
Abstract
Visual anomaly detection (AD) presents significant challenges due to the scarcity of anomalous data samples. While numerous works have been proposed to synthesize anomalous samples, these synthetic anomalies often lack authenticity or require extensive training data, limiting their applicability in real-world scenarios.In this work, we propose Anomaly Anything (AnomalyAny), a novel framework that leverages Stable Diffusion (SD)'s image generation capabilities to generate diverse and realistic unseen anomalies. By conditioning on a single normal sample during test time, AnomalyAny is able to generate unseen anomalies for arbitrary object types with text descriptions. Within AnomalyAny, we propose attention-guided anomaly optimization to direct SD’s attention on generating hard anomaly concepts. Additionally, we introduce prompt-guided anomaly refinement, incorporating detailed descriptions to further improve the generation quality. Extensive experiments on MVTec AD and VisA datasets demonstrate AnomalyAny's ability in generating high-quality unseen anomalies and its effectiveness in enhancing downstream AD performance.
Poster
Lei Fan · Dongdong Fan · Zhiguang Hu · Yiwen Ding · Donglin Di · Kai Yi · Maurice Pagnucco · Yang Song
[ ExHall D ]
Abstract
We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for < what, why, how>, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.
Poster
Shilhora Akshay · Niveditha Lakshmi Narasimhan · Jacob George · Vineeth Balasubramanian
[ ExHall D ]
Abstract
Anomaly detection and localization remain pivotal challenges in computer vision, with applications ranging from industrial inspection to medical diagnostics. While current supervised methods offer high precision, they are often impractical due to the scarcity of annotated data and the infrequent occurrence of anomalies. Recent advancements in unsupervised approaches, particularly reconstruction-based methods, have addressed these issues by training models exclusively on normal data, enabling them to identify anomalies during inference. However, these methods frequently rely on auxiliary networks or specialized adaptations, which can limit their robustness and practicality. This work introduces the Latent Anomaly Schrodinger Bridge (LASB), a unified unsupervised anomaly detection model that operates entirely in the latent space without requiring additional networks or custom modifications. LASB transforms anomaly images into normal images by preserving structural integrity across varying anomaly classes, lighting, and pose conditions, making it highly robust and versatile. Unlike previous methods, LASB does not focus solely on reconstructing anomaly features but emphasizes anomaly transformation, achieving smooth anomaly-to-normal image conversions. Our method achieves state-of-the-art performance on both the MVTec-AD and VisA datasets, excelling in detection and localization tasks.
Poster
Yoon Gyo Jung · Jaewoo Park · Jaeho Yoon · Kuan-Chuan Peng · Wonchul Kim · Andrew Beng Jin Teoh · Octavia Camps
[ ExHall D ]
Abstract
We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.
Poster
Shubhang Bhatnagar · Narendra Ahuja
[ ExHall D ]
Abstract
Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model that instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.
Poster
Yanghao Wang · Long Chen
[ ExHall D ]
Abstract
Data Augmentation (DA), i.e., synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve the performance of various data-scarce tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different image classification benchmarks. In this paper, we analyze today's diffusion-based DA methods, and argue that they cannot take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting classification performance. To this end, we propose a novel Diffusion-based DA method: Diff-II. Specifically, it consists of three steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting circle interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on various data-scarce image classification tasks (e.g., few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.
Poster
Shaobo Wang · Yicun Yang · Zhiyuan Liu · Chenghao Sun · Xuming Hu · Conghui He · Linfeng Zhang
[ ExHall D ]
Abstract
Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF's frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (NCFM), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5\% accuracy boost on ImageSquawk. Our method also reduces GPU memory …
Poster
Wenliang Zhong · Haoyu Tang · Qinghai Zheng · Mingzhu Xu · Yupeng Hu · Weili Guan
[ ExHall D ]
Abstract
The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT creates convex combinations of expert trajectories by selecting a few expert models, guiding student networks to converge quickly and stably. This trajectory is not only easier to store, but also enables continuous sampling strategies during the distillation process, ensuring thorough learning and fitting of the entire expert trajectory. The comprehensive …
Poster
Felipe del Rio · Alain Raymond · Daniel Florea · Rodrigo Toro Icarte · Julio Hurtado · Cristian Buc Calderon · Alvaro Soto
[ ExHall D ]
Abstract
Deep neural networks (DNNs) struggle at systematic generalization (SG). Several studies have evaluated the possibility to promote SG through the proposal of novel architectures, loss functions or training methodologies. Few studies, however, have focused on the role of training data properties in promoting SG. In this work, we investigate the impact of certain data distributional properties, as inductive biases for the SG ability of a multi-modal language model. To this end, we study three different properties. First, data diversity, instantiated as an increase in the possible values a latent property in the training distribution may take. Second, burstiness, where we probabilistically restrict the number of possible values of latent factors on particular inputs during training. Third, latent intervention, where a particular latent factor is altered randomly during training. We find that all three factors significantly enhance SG, with diversity contributing an 89\% absolute increase in accuracy in the most affected property. Through a series of experiments, we test various hypotheses to understand why these properties promote SG. Finally, we find that Normalized Mutual Information (NMI) between latent attributes in the training distribution is strongly predictive of out-of-distribution generalization. We find that a mechanism by which lower NMI induces SG is …
Poster
Seokju Yun · Seunghye Chae · Dongheon Lee · Youngmin Ro
[ ExHall D ]
Abstract
Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Low-Rank Adaptation (SoRA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoRA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Furthermore, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoRA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of …
Poster
Hao Zhu · Yifei Zhang · Junhao Dong · Piotr Koniusz
[ ExHall D ]
Abstract
Continual learning requires models to learn tasks sequentially while maintaining a delicate balance between stability (retaining knowledge of previous tasks) and plasticity (adapting to new tasks). A key challenge is preventing interference between tasks - where learning new tasks degrades performance on previously learned ones. Recent approaches have leveraged parameter-efficient fine-tuning (PEFT) methods, which adapt pre-trained models by injecting a small number of learnable parameters. However, existing PEFT-based continual learning methods like InfLoRA face fundamental limitations: they rely on complex optimization procedures to learn orthogonal task-specific spaces, and finding such spaces becomes increasingly difficult as tasks accumulate. We propose a novel bilinear reformulation that fundamentally reimagines task separation through fixed orthogonal bases. Our key insight is that by expanding the parameter space quadratically through two fixed bases, we can achieve "almost orthogonal" task subspaces probabilistically, eliminating the need for explicit interference elimination procedures. We provide theoretical guarantees that this approach reduces the probability of task interference from \bigO((k/d)2) to \bigO((k/d2)2), ensuring reliable task separation without complex optimization. Through extensive experiments on ImageNet-R, CIFAR100, and DomainNet, we validate our theoretical bounds and demonstrate state-of-the-art performance with reduced parameter count.
Poster
Haoyang Li · Liang Wang · Chao Wang · Jing Jiang · Yan Peng · Guodong Long
[ ExHall D ]
Abstract
The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve …
Poster
Guowei Wang · Changxing Ding
[ ExHall D ]
Abstract
Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs. This paper's code will be released.
Poster
Ye Liu · Meng Yang
[ ExHall D ]
Abstract
Few-shot class-incremental learning (FSCIL) presents a significant challenge in machine learning, requiring models to integrate new classes from limited examples while preserving performance on previously learned classes. Recently, prompt-based CIL approaches leverage ample data to train prompts, effectively mitigating catastrophic forgetting. However, these methods do not account for the semantic features embedded in prompts, exacerbating the plasticity-stability dilemma in few-shot incremental learning. In this paper, we propose a novel and simple framework named SEmantic Complementary Prompt(SEC-Prompt), which learns two sets of semantically complementary prompts based on an adaptive query: discriminative prompts(D-Prompt) and non-discriminative prompts(ND-Prompt). D-Prompt enhances the separation of class-specific feature distributions by strengthening key discriminative features, while ND-Prompt balances non-discriminative information to promote generalization to novel classes. To efficiently learn high-quality knowledge from limited samples, we leverage ND-Prompt for data augmentation to increase sample diversity and introduce Prompt Clustering Loss to prevent noise contamination in D-Prompt, ensuring robust discriminative feature learning and improved generalization. Our experimental results showcase state-of-the-art performance across four benchmark datasets, including CIFAR100, ImageNet-R and CUB datasets.
Poster
Li-Jun Zhao · Zhen-Duo Chen · Yongxin Wang · Xin Luo · Xin-Shun Xu
[ ExHall D ]
Abstract
Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn novel classes with limited samples after pre-training on a set of base classes. To avoid catastrophic forgetting and overfitting, most FSCIL methods first train the model on the base classes and then freeze the feature extractor in the incremental sessions. However, the reliance on nearest neighbor classification makes FSCIL prone to the hubness phenomenon, which negatively impacts performance in this dynamic and open scenario. While recent methods attempt to adapt to the dynamic and open nature of FSCIL, they are often limited to biased optimizations to the feature space. In this paper, we pioneer the theoretical analysis of the inherent hubness in FSCIL. To mitigate the negative effects of hubness, we propose a novel Attraction Diminishing and Distributing (D2A) method from the essential perspectives of distance metric and feature space. Extensive experimental results demonstrate that our method can broadly and significantly improve the performance of existing methods.
Poster
Kai Fang · Anqi Zhang · Guangyu Gao · Jianbo Jiao · Chi Harold Liu · Yunchao Wei
[ ExHall D ]
Abstract
Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous (old) and incremental (new) classes.To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization (CoMBO).Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class-specific adapters.Moreover, we develop two strategies to further mitigate the conflict following the branched structure, i.e., the Half-Learning Half-Distillation (HDHL) over classification probabilities, and the Importance-Based Knowledge Distillation (IKD) over query features.HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples .Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve.Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. The code is available in the …
Poster
Yanbiao Ma · Wei Dai · Wenke Huang · Jiayi Chen
[ ExHall D ]
Abstract
Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence.
Poster
Sebastian Schmidt · Leonard Schenk · Leo Schwinn · Stephan Günnemann
[ ExHall D ]
Abstract
As the data demand for deep learning models increases, active learning becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs.Recent work addresses important real-world considerations of active learning, such as handling out-of-distribution (OOD) data and online discovery of novel object categories. However, a combined analysis of these scenarios remains unexplored.To address this gap regarding real-world considerations, we propose a novel scenario, Open-Set Discovery Active Learning (OSDAL), which integrates OOD sample handling and novel category discovery.In contrast to previous methods, we construct a common feature space within a single model that aligns known and novel categories while separating OOD samples.This enables our approach, Joint Out-of-distribution filtering and data Discovery Active learning (Joda), to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling.Unlike previous work, Joda does not require auxiliary detection models for filtering or selection and is, therefore, effectively reducing the computational overhead.In extensive experiments on 15 configurations and 3 metrics, Joda achieves consistently the highest or equally high accuracy as state-of-the-art competitor approaches in 39 out of 45 cases.
Poster
Ronghang Zhu · Mengxuan Hu · Weiming Zhuang · Lingjuan Lyu · Xiang Yu · Sheng Li
[ ExHall D ]
Abstract
Domain adaptation addresses the challenge where the distribution of target inference data differs from that of the source training data. Recently, data privacy has become a significant constraint, limiting access to the source domain. To mitigate this issue, Source-Free Domain Adaptation (SFDA) methods bypass source domain data by generating source-like data or pseudo-labeling the unlabeled target domain. However, these approaches often lack theoretical grounding. In this work, we provide a theoretical analysis of the SFDA problem, focusing on the general empirical risk of the unlabeled target domain. Our analysis offers a comprehensive understanding of how representativeness, generalization, and variety contribute to controlling the upper bound of target domain empirical risk in SFDA settings. We further explore how to balance this trade-off from three perspectives: sample selection, semantic domain alignment, and a progressive learning framework. These insights inform the design of novel algorithms. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on three benchmark datasets—Office-Home, DomainNet, and VisDA-C—yielding relative improvements of 3.2%, 9.1%, and 7.5%, respectively, over the representative SFDA method, SHOT.
Poster
Junyi Chai · Shenyu Lu · Xiaoqian Wang
[ ExHall D ]
Abstract
Multi-task learning (MTL) is a paradigm that aims to improve the generalization of models by simultaneously learning multiple related tasks, leveraging shared representations and task-specific information to capture complex patterns and to enhance performance on individual tasks. However, existing work has discovered that MTL could possibly harm generalization, and one particular reason is the spurious correlations between tasks, where owing to the knowledge-sharing property, the task-specific predictors are more likely to develop reliance on spurious features. Most existing approaches address this issue through distributional robustness, aiming to maintain consistent performance across different distributions under unknown covariate shifts. Yet, this formulation lacks theoretical guarantee and can be sensitive to the construction of covariate shift. In this work, we propose a novel perspective, where we seek to directly identify the spurious correlations between tasks. Drawing inspirations from conventional formulations on spurious correlation, for each task, we propose to distinguish its spurious tasks using the difference in correlation coefficients between the empirical distribution and class-wise resampled distributions, thereby capturing the correlations between task labels w.r.t. each class. We prove theoretically the feasibility of such resampling strategy in characterizing the spurious correlation between tasks. Following the identification of task-specific spurious information, we propose a …
Poster
Na Zheng · Xuemeng Song · Xue Dong · Aashish Nikhil Ghosh · Liqiang Nie · Roger Zimmermann
[ ExHall D ]
Abstract
Recent studies have focused on introducing pre-trained foundation models into semi-supervised learning (SSL) tasks. Nevertheless, these foundation models can exhibit biases toward different classes and tend to generate imbalanced pseudo-labels for SSL. Thus, efforts have been made to introduce the logit adjustment offset to reduce the inherent bias in foundation models for SSL tasks.Despite their success, existing foundation model-based SSL methods face challenges: 1) unreliability in the estimated logit adjustment offset, 2) overlooking the potential of linguistic knowledge in capturing model biases and 3) fail to fully exploit the unlabeled samples. To address these issues, we propose Language-Assisted Debiasing and Smoothing framework, namely LADaS, for foundation model-based SSL. It consists of two components: 1) Language-assisted Pseudo-Label Debiasing (LPLD) to reduce biases in foundation models, and 2) Language-aware Pseudo-Label Smoothing (LPLS) to fully exploit low-confidence samples to facilitate SSL training. In particular, LPLD introduces a reliability score to dynamically assess the reliability of the logit adjustment. Additionally, it incorporates a language-oriented preference to reduce model biases using linguistic knowledge derived from pre-trained language models. Finally, LPLS introduces language-aware soft labels and devises language-aware pseudo-label smoothing loss to guide the learning of unlabeled samples with low-quality pseudo-labels. Extensive experiments demonstrate the superiority …
Poster
Lilin Zhang · Chengpei Wu · Ning Yang
[ ExHall D ]
Abstract
The existing adversarial training (AT) methods often suffer from incomplete perturbation, i.e., not all non-robust features are perturbed during the generation of AEs, which causes remaining correlations of non-robust features with labels captured by the target model, i.e., suboptimal learning of robust features. However, fulfilling complete perturbation, i.e., perturbing as many non-robust features as possible, is not easy due to the challenges of unidentifiability of robust/non-robust features and the sparsity of labeled data. To overcome these challenges, we propose a novel solution called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT fulfills complete perturbation for better learning of robust features by blocking the correlations between non-robust features and labels, via complete AE generation over partially labeled data grounded in information theory. The solid theoretical analysis and the extensive experiments conducted on widely adopted benchmarks verify the superiority of WSCAT.
Poster
Qi Chen · Hu Ding
[ ExHall D ]
Abstract
Out-of-distribution (OOD) detection is crucial for machine learning models deployed in open-world environments. However, existing methods often struggle with model over-confidence or rely heavily on empirical energy value estimation, limiting their scalability and generalizability. This paper introduces DEBO (Dual Energy-Based Model for Out-of-distribution Detection), a novel approach that addresses these limitations through an innovative dual classifier architecture and a unified energy-based objective function. DEBO enhances the standard classification framework by integrating a dual-purpose output space within a single classifier. The primary component classifies in-distribution (ID) data conventionally, while the secondary component captures open-world information and estimates uncertainty. Our method overcomes the dependence of traditional energy model-based OOD detection methods on empirical energy estimation while maintaining theoretical guarantees. Theoretical analysis demonstrates that DEBO promotes low energy and high confidence for ID data, while simultaneously inducing higher energy and decreased confidence for OOD samples. Extensive experiments conducted on benchmark datasets reveal that DEBO achieves state-of-the-art OOD detection performance while maintaining comparable classification accuracy on ID data.
Poster
Senyu Hou · Gaoxia Jiang · Jia Zhang · Shangrong Yang · Husheng Guo · Yaqing Guo · Wenjian Wang
[ ExHall D ]
Abstract
In image classification, the label quality of training data critically influences model generalization, especially for deep neural networks (DNNs). Traditionally, learning from noisy labels (LNL) can improve the generalization of DNNs through complex architectures or a series of robust techniques, but its performance improvement is limited by the discriminative paradigm. Unlike traditional ways, we resolve the LNL problems from the perspective of robust label generation, based on diffusion models within the generative paradigm. To expand the diffusion model into a robust classifier that explicitly accommodates more noise knowledge, we propose a Directional Label Diffusion (DLD) model. It disentangles the diffusion process into two paths, i.e., directional diffusion and random diffusion. Specifically, directional diffusion simulates the corruption of true labels into a directed noise distribution, prioritizing the removal of likely noise, whereas random diffusion introduces inherent randomness to support label recovery. This architecture enable DLD to gradually infer labels from an initial random state, interpretably diverging from the specified noise distribution. To adapt the model to diverse noisy environments, we design a low-cost label pre-correction method that automatically supplies more accurate label information to the diffusion model, without requiring manual intervention or additional iterations. In addition, we optimize the paradigm for …
Poster
Yunlu Yan · Huazhu Fu · Yuexiang Li · Jinheng Xie · Jun Ma · Guang Yang · Lei Zhu
[ ExHall D ]
Abstract
Federated Learning (FL) facilitates collaborative learning among multiple clients in a distributed manner and ensures the security of privacy. However, its performance inevitably degrades with non-Independent and Identically Distributed (non-IID) data. In this paper, we focus on the feature distribution skewed FL scenario, a common non-IID situation in real-world applications where data from different clients exhibit varying underlying distributions. This variation leads to feature shift, which is a key issue of this scenario. While previous works have made notable progress, few pay attention to the data itself, i.e., the root of this issue. The primary goal of this paper is to mitigate feature shift from the perspective of data. To this end, we propose a simple yet remarkably effective input-level data augmentation method, namely FedRDN, which randomly injects the statistical information of the local distribution from the entire federation into the client's data. This is beneficial to improve the generalization of local feature representations, thereby mitigating feature shift. Moreover, our FedRDN is a plug-and-play component, which can be seamlessly integrated into the data augmentation flow with only a few lines of code. Extensive experiments on several datasets show that the performance of various representative FL methods can be further improved …
Poster
Yasser Khalil · Leo Maxime Brunswic · Soufiane Lamghari · Xu Li · Mahdi Beitollahi · Xi Chen
[ ExHall D ]
Abstract
Federated unlearning (FU) aims to remove a participant’s data contributions from a trained federated learning (FL) model, ensuring privacy and regulatory compliance. Traditional FU methods often depend on auxiliary storage on either the client or server side or require direct access to the data targeted for removal—a dependency that may not be feasible if the data is no longer available. To overcome these limitations, we propose NoT, a novel and efficient FU algorithm based on weight negation (multiplying by -1), which circumvents the need for additional storage and access to the target data. We argue that effective and efficient unlearning can be achieved by perturbing model parameters away from the set of optimal parameters, yet being well-positioned for quick re-optimization. This technique, though seemingly contradictory, is theoretically grounded: we prove that the weight negation perturbation effectively disrupts inter-layer co-adaptation, inducing unlearning while preserving an approximate optimality property, thereby enabling rapid recovery. Experimental results across three datasets and three model architectures demonstrate that NoT significantly outperforms existing baselines in unlearning efficacy as well as in communication and computational efficiency.
Poster
Ye Li · Yanchao Zhao · chengcheng zhu · Jiale Zhang
[ ExHall D ]
Abstract
Federated Learning (FL), a privacy-preserving decentralized machine learning framework, has been shown to be vulnerable to backdoor attacks. Current research primarily focuses on the Single-Label Backdoor Attack (SBA), wherein adversaries share a consistent target. However, a critical fact is overlooked: adversaries may be non-cooperative, have distinct targets, and operate independently, which exhibits a more practical scenario called Multi-Label Backdoor Attack (MBA). Unfortunately, prior works are ineffective in MBA scenario since non-cooperative attackers exclude each other. In this work, we conduct an in-depth investigation to uncover the inherent constraints of the exclusion: similar backdoor mappings are constructed for different targets, resulting in conflicts among backdoor functions. To address this limitation, we propose Mirage, the first non-cooperative MBA strategy in FL that allows attackers to inject effective and persistent backdoors into the global model without collusion by constructing in-distribution (ID) backdoor mapping. Specifically, we introduce an adversarial adaptation method to bridge the backdoor features and the target distribution in an ID manner. Additionally, we further leverage a constrained optimization method to ensure the ID mapping survives in the global training dynamics. Extensive evaluations demonstrate that Mirage outperforms various state-of-the-art attacks and bypasses existing defenses, achieving an average ASR greater than 97\% and …
Poster
Dongyoon Yang · Jihu Lee · Yongdai Kim
[ ExHall D ]
Abstract
Robust domain adaptation against adversarial attacks is a critical area of research, addressing the need for models to perform consistently across diverse, challenging domains. In this paper, we derive a new generalization bound for robust risk on a target domain, using a novel divergence measure specifically tailored for robust domain adaptation. Inspired by this generalization bound, we propose a new algorithm named TAROT, which is designed to enhance domain adaptability and robustness. Additionally, we empirically demonstrate that a simple pseudo labeling approach, when combined with robust pretraining (Robust-PT), establishes a surprisingly strong baseline that surpasses traditional robust domain adaptation algorithms. Through extensive experiments, we illustrate that TAROT not only outperforms state-of-the-art methods in accuracy and robustness but also shows substantial scalability improvements. This improvements are done particularly in the challenging DomainNet benchmark dataset, emphasizing our algorithm's effectiveness and potential for broader applications.
Poster
Hanrong Zhang · Zhenting Wang · Boheng Li · Fulin Lin · Tingxu Han · Mingyu Jin · Chenlu Zhan · Mengnan Du · Hongwei Wang · Shiqing Ma
[ ExHall D ]
Abstract
Self-supervised learning (SSL) models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in SSL often involve noticeable triggers, like colored patches or visible noise, which are vulnerable to human inspection. This paper proposes an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers designed for supervised learning are less effective in compromising self-supervised models. We then identify this ineffectiveness is attributed to the overlap in distributions between the backdoor and augmented samples used in SSL. Building on this insight, we design an attack using optimized triggers disentangled with the augmented transformation in the SSL, while remaining imperceptible to human vision. Experiments on five datasets and six SSL algorithms demonstrate our attack is highly effective and stealthy. It also has strong resistance to existing backdoor defenses.
Poster
Aishik Konwer · Zhijian Yang · Erhan Bas · Cao Xiao · Prateek Prasanna · Parminder Bhatia · Taha Kass-Hout
[ ExHall D ]
Abstract
Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
Poster
Kaisheng Liang · Xuelong Dai · Yanjie Li · Dong Wang · Bin Xiao
[ ExHall D ]
Abstract
Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
Poster
Meilong Xu · Saumya Gupta · Xiaoling Hu · Chen Li · Shahira Abousamra · Dimitris Samaras · Prateek Prasanna · Chao Chen
[ ExHall D ]
Abstract
Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Fréchet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships.
Poster
Han Liu · Peng Cui · Bingning Wang · Weipeng Chen · Yupeng Zhang · Jun Zhu · Xiaolin Hu
[ ExHall D ]
Abstract
Deep Neural Networks (DNNs) have achieved remarkable success in a variety of tasks, particularly in terms of prediction accuracy. However, in real-world scenarios, especially in safety-critical applications, accuracy alone is insufficient; reliable uncertainty estimates are essential. Modern DNNs, often trained with cross-entropy loss, tend to exhibit overconfidence, especially on ambiguous samples. Many techniques aim to improve uncertainty calibration, yet they often come at the cost of reduced accuracy or increased computational demands.To address this challenge, we propose Differentiated Deep Mutual Learning (Diff-DML), an efficient ensemble approach that simultaneously enhances accuracy and uncertainty calibration. Diff-DML draws inspiration from Deep Mutual Learning (DML) while introducing two strategies to maintain prediction diversity: (1) Differentiated Training Strategy (DTS) and (2) Diversity-Preserving Learning Objective (DPLO). Our theoretical analysis shows that Diff-DML’s diversified learning framework not only leverages ensemble benefits but also avoids the loss of prediction diversity observed in traditional DML setups, which is crucial for improved calibration.Extensive evaluations on various benchmarks confirm the effectiveness of Diff-DML. For instance, on the CIFAR-100 dataset, Diff-DML on ResNet34/50 models achieved substantial improvements over the previous state-of-the-art method, MDCA, with absolute accuracy gains of 1.3%/3.1%, relative ECE reductions of 49.6%/43.8%, and relative classwise-ECE reductions of 7.7%/13.0%.
Poster
Ren Wang · Haoliang Sun · Yuxiu Lin · Chuanhui Zuo · Yongshun Gong · Yilong Yin · Wenjia Meng
[ ExHall D ]
Abstract
Multi-view representation learning integrates multiple observable views of an entity into a unified representation to facilitate downstream tasks. Current methods predominantly focus on distinguishing compatible components across views, followed by a single-step parallel fusion process. However, this parallel fusion is static in essence, overlooking potential conflicts among views and compromising representation ability. To address this issue, this paper proposes a novel \textbf{Seq}uential fusion framework for \textbf{M}ulti-\textbf{v}iew \textbf{R}epresentation \textbf{L}earning, termed \textbf{SeqMvRL}. Specifically, we model multi-view fusion as a sequential decision-making problem and construct a pairwise integrator (PI) and a next-view selector (NVS), which represent the \textit{environment} and \textit{agent} in reinforcement learning, respectively. PI merges the current fused feature with the selected view, while NVS is introduced to determine which view to fuse subsequently. By adaptively selecting the next optimal view for fusion based on the current fusion state, SeqMvRL thereby effectively reduces conflicts and enhances unified representation quality. Additionally, an elaborate novel reward function encourages the model to prioritize views that enhance the discriminability of the fused features. Experimental results demonstrate that SeqMvRL outperforms parallel fusion approaches in classification and clustering tasks.
Poster
Bowen Zhao · Qianqian Wang · Zhengming Ding · Quanxue Gao
[ ExHall D ]
Abstract
The success of existing deep multi-view graph clustering methods is based on the assumption that node attributes are fully available across all views. However, in practical scenarios, node attributes are frequently missing due to factors such as data privacy concerns or failures in data collection devices. Although some methods have been proposed to address the issue of missing node attributes, they come with the following limitations: \textit{i}) Existing methods are often not tailored specifically for clustering tasks and struggle to address missing attributes effectively. \textit{ii}) They tend to ignore the relational dependencies between nodes and their neighboring nodes. This oversight results in unreliable imputations, thereby degrading clustering performance. To address the above issues, we propose an \textbf{A}ttribute-\textbf{M}issing \textbf{M}ulti-view \textbf{G}raph \textbf{C}lustering (AMMGC). Specifically, we first impute missing node attributes by leveraging neighborhood information through an adjacency matrix. Then, to improve the consistency, we integrate a dual structure consistency module that aligns graph structures across multiple views, reducing redundancy and retaining key information. Furthermore, we introduce a high-confidence guidance module to improve the reliability of clustering. Extensive experiment results showcase the effectiveness and superiority of our proposed method on multiple benchmark datasets.
Poster
Thomas Dagès · Simon Weber · Ya-Wei Eileen Lin · Ronen Talmon · Daniel Cremers · Michael Lindenbaum · Alfred M. Bruckstein · Ron Kimmel
[ ExHall D ]
Abstract
Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean spaces, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.
Poster
Chengxiang Huang · Yake Wei · Zequn Yang · Di Hu
[ ExHall D ]
Abstract
Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities.To address this issue, we propose \textbf{I}nformation \textbf{A}cquisition \textbf{R}egulation (IAR), a method designed to balance information acquisition among modalities. Specifically, IAR slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that IAR outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance.
Poster
Guanzhou Ke · Shengfeng He · Xiao-Li Wang · Bo Wang · Guoqing Chao · Yuanyang Zhang · Yi Xie · HeXing Su
[ ExHall D ]
Abstract
Previous successful approaches to missing modality completion rely on carefully designed fusion techniques and extensive pre-training on complete data, which can limit their generalizability in out-of-domain (OOD) scenarios. In this study, we pose a new challenge: can we develop a missing modality completion model that is both resource-efficient and robust to OOD generalization? To address this, we present a training-free framework for missing modality completion that leverages large multimodal models (LMMs). Our approach, termed the "Knowledge Bridger”, is modality-agnostic and integrates generation and ranking of missing modalities. By defining domain-specific priors, our method automatically extracts structured information from available modalities to construct knowledge graphs. These extracted graphs connect the missing modality generation and ranking modules through the LMM, resulting in high-quality imputations of missing modalities. Experimental results across both general and medical domains show that our approach consistently outperforms competing methods, including in OOD generalization. Additionally, our knowledge-driven generation and ranking techniques demonstrate superiority over variants that directly employ LMMs for generation and ranking, offering insights that may be valuable for applications in other domains.
Poster
Max Gutbrod · David Rauber · Danilo Weber Nunes · Christoph Palm
[ ExHall D ]
Abstract
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The full repository is available at https://github.com/xxxx/xxx.
Poster
Mariamma Antony · Rajiv Porana · Sahil M. Lathiya · Siva Teja Kakileti · Chiranjib Bhattacharyya
[ ExHall D ]
Abstract
Mobile health (mHealth) has emerged as a transformative solution to enhance healthcare accessibility and affordability, particularly in resource-constrained regions and low-to-middle-income countries.mHealth leverages mobile platforms to improve healthcare accessibility, addressing radiologist shortages in low-resource settings by enabling remote diagnosis and consultation through mobile devices. Mobile phones allow healthcare workers to transmit radiographic images, such as chest X-rays (CXR), to specialists or AI-driven models for interpretation. However, AI-based diagnosis using CXR images shared via apps like WhatsApp suffers from reduced predictability and explainability due to compression artifacts, and there is a lack of datasets to systematically study these challenges. To address this, we introduce CheXwhatsApp, a dataset of 175,029 paired original and WhatsApp-compressed CXR images. We present a benchmarking study which shows the dataset improves prediction stability and explainability of state-of-the-art models by up to 80%, while also enhancing localization performance. CheXwhatsApp is open-sourced to support advancements in mHealth applications for CXR analysis.
Poster
Hanbin Ko · Chang Min Park
[ ExHall D ]
Abstract
The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and improving the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model’s understanding of the complexities of clinical language. Our approach integrates seamlessly into any medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification and report retrieval. To further assess our model’s capacity for clinical language comprehension, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.
Poster
Shahad Albastaki · Anabia Sohail · IYYAKUTTI IYAPPAN GANAPATHI · Basit Alawode · Asim Khan · Sajid Javed · Naoufel Werghi · Mohammed Bennamoun · Arif Mahmood
[ ExHall D ]
Abstract
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. This method aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms State-Of-The-Art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at xxx.
Poster
Tong Wang · Mingkang Wang · Zhongze Wang · Hongkai Wang · Qi Xu · Fengyu Cong · Hongming Xu
[ ExHall D ]
Abstract
Recently, virtual staining has emerged as a promising alternative to revolutionize histological staining by digitally generating stains. However, most existing methods suffer from the curse of staining unreality and unreliability. In this paper, we propose the Orthogonal Decoupling Alignment Generative Adversarial Network (ODA-GAN) for unpaired virtual immunohistochemistry (IHC) staining. Our approach is based on the assumption that an image consists of IHC staining-related features, which influence staining distribution and intensity, and staining-unrelated features, such as tissue morphology. Leveraging a pathology foundation model, we first develop a weakly-supervised segmentation pipeline as an alternative to expert annotations. We introduce an Orthogonal MLP (O-MLP) module to project image features into an orthogonal space, decoupling them into staining-related and unrelated components. Additionally, we propose a Dual-stream PatchNCE (DPNCE) loss to resolve contrastive learning contradictions in the staining-related space, thereby enhancing staining accuracy. To further improve realism, we introduce a Multi-layer Domain Alignment (MDA) module to bridge the domain gap between generated and real IHC images. Extensive evaluations on three benchmark datasets show that our ODA-GAN reaches state-of-the-art (SOTA) performance. Our source code is available at ***.
Poster
Yisi Luo · Xile Zhao · Kai Ye · Deyu Meng
[ ExHall D ]
Abstract
Spatial transcriptomics (ST) are emerging technologies that reveal spatial distributions of gene expressions within tissues, serving as important ways to uncover biological insights. However, the irregular spatial profiles and variability of genes make it challenging to integrate spatial information with gene expression under a computational framework. Current algorithms mostly utilize spatial graph neural networks to encode spatial information, which may incur increased computational costs and may not be flexible enough to depict complex spatial configurations. In this study, we introduce a concise yet effective representation framework, STINR, for deciphering ST data. STINR leverages an implicit neural representation (INR) to continuously represent ST data, which efficiently characterizes spatial and slice-wise correlations of ST data by inheriting the implicit smoothness of INR. STINR allows easier integration of multiple slices and multi-omics without any alignment, and serves as a potent tool for various biological tasks including gene imputation, gene denoising, spatial domain detection, and cell-type deconvolution stemed from ST data. In particular, STINR identifies the thinnest cortex layer in the dorsolateral prefrontal cortex which previous methods were unable to achieve, and more accurately identifies tumor regions in the human squamous cell carcinoma, showcasing its practical value for biological discoveries.
Poster
Zheng Zhang · Guanchun Yin · Bo Zhang · Wu Liu · Xiuzhuang Zhou · Wendong Wang
[ ExHall D ]
Abstract
The limited data annotations have made semi-supervised learning (SSL) increasingly popular in medical image analysis. However, the use of pseudo labels in SSL degrades the performance of decoders that heavily rely on high-accuracy annotations. This issue is particularly pronounced in class-imbalanced multi-organ segmentation tasks, where small organs may be under-segmented or even ignored. In this paper, we propose a semantic knowledge complementarity based decoupling framework for accurate multi-organ segmentation in class-imbalanced CT images. The framework decouples the data flow based on the responsibilities of the encoder and decoder during model training to make the model effectively learn semantic features, while mitigating the negative impact of unlabeled data on the semantic segmentation task. Then, we design a semantic knowledge complementarity module that adopt labeled data to guide the generation of pseudo labels and enriches the semantic features of labeled data with unlabeled data, which improves the quality of generated pseudo labels and the robustness of the overall model. Furthermore, we also design an auxiliary balanced segmentation head based training strategy to further enhance the segmentation performance of small organs. Extensive experiments on the Synapse and AMOS datasets show that our method significantly outperforms existing state-of-the-art methods.
Poster
Theodore Zhao · Sid Kiblawi · Mu Wei · Ho Hin Lee · J. Samuel Preston · Naoto Usuyama · Hoifung Poon
[ ExHall D ]
Abstract
Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1\% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations.In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy.BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
Poster
Rong Qin · Xingyu Liu · Jinglei Shi · Liang Lin · Jufeng Yang
[ ExHall D ]
Abstract
Over the last decade, significant efforts have been dedicated to designing efficient models for the challenge of ultra-high resolution (UHR) semantic segmentation. These models mainly follow the dual-stream architecture and generally fall into three subcategories according to the improvement objectives, i.e., dual-stream ensemble, selective zoom, and complementary learning. However, most of them overly concentrate on crafting complex pipelines to pursue one of the above objectives separately, limiting the model performance in both accuracy and inference consumption. In this paper, we suggest simultaneously achieving these objectives by estimating resolution-biased uncertainties in low resolution stream. Here, the resolution-biased uncertainty refers to the degree of prediction unreliability primarily caused by resolution loss from down-sampling operations. Specifically, we propose a dual-stream UHR segmentation framework, where an estimator is used to assess resolution-biased uncertainties through the entropy map and high-frequency feature residual. The framework also includes a selector, an ensembler, and a complementer to boost the model with obtained estimations. They share the uncertainty estimations as the weights to choose difficult regions as the inputs for UHR stream, perform weighted fusion between distinct streams, and enhance the learning for important pixels, respectively. Experiment results demonstrate that our method achieves a satisfactory balance between accuracy and …
Poster
Yankai Jiang · Peng Zhang · Donglin Yang · Yuan Tian · Hai Lin · Xiaosong Wang
[ ExHall D ]
Abstract
We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. The codes will be made publicly available.
Poster
Zheyu Zhang · Yayuan Lu · Feipeng Ma · Yueyi Zhang · Huanjing Yue · Xiaoyan Sun
[ ExHall D ]
Abstract
Brain tumor segmentation plays a crucial role in clinical diagnosis, yet the frequent unavailability of certain MRI modalities poses a significant challenge. In this paper, we introduce the Learnable Sorting State Space Model (LS3M), a novel framework designed to maximize the utilization of available modalities for brain tumor segmentation. LS3M excels at efficiently modeling long-range dependencies based on the Mamba design, while incorporating differentiable permutation matrices that reorder input sequences based on modality-specific characteristics. This dynamic reordering ensures that critical spatial inductive biases and long-range semantic correlations inherent in 3D brain MRI are preserved, which is crucial for imcomplete multi-modal brain tumor segmentation.Once the input sequences are reordered using the generated permutation matrix, the Series State Space Model (S3M) block models the relationships between them, capturing both local and long-range dependencies. This enables effective representation of intra-modal and inter-modal relationships, significantly improving segmentation accuracy.Additionally, LS3M incorporates a global input strategy, augmented with relative position embeddings, providing richer contextual information and notably enhancing spatial awareness. Extensive experiments on the BraTS2018 and BraTS2020 datasets demonstrate that LS3M outperforms existing methods, offering a robust solution for brain tumor segmentation, particularly in scenarios with missing modalities.
Poster
Yang Yue · Yulin Wang · Haojun Jiang · Pan Liu · Shiji Song · Gao Huang
[ ExHall D ]
Abstract
Echocardiography is essential for cardiovascular disease detection, but it usually suffers from a heavy reliance on experienced sonographers. To address this, the echocardiography probe guidance system, which predicts real-time movement instructions for acquiring standard plane images, has emerged as a promising technique for enabling fully autonomous or AI-assisted echocardiography scanning.However, it poses unique challenges in developing proper machine learning models, which have rarely been explored in existing studies.In particular, an ideal guidance model needs to comprehend both the heart’s structural anatomy and the dynamic changes resulting from probe movements, while integrating historical visual-motion signals into the decision-making process.In response to these issues, this paper presents EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures …
Poster
Armeet Singh Jatyani · Jiayun Wang · Aditi Chandrashekar · Zihui Wu · Miguel Liu-Schiaffini · Bahareh Tolooshams · Anima Anandkumar
[ ExHall D ]
Abstract
Compressed Sensing MRI reconstructs images of the body's internal anatomy from undersampled measurements, thereby reducing the scan time—the time subjects need to remain still. Recently, deep neural networks have shown great potential for reconstructing high-fidelity images from highly undersampled measurements in the frequency space. However, one needs to train multiple models for different undersampling patterns and desired output image resolutions, since most networks operate on a fixed discretization. Such approaches are highly impractical in clinical settings, where undersampling patterns and image resolutions are frequently changed to accommodate different real-time imaging and diagnostic requirements.We propose a unified model robust to different measurement undersampling patterns and image resolutions in compressed sensing MRI. Our model is based on neural operators, a discretization-agnostic architecture. Neural operators are employed in both image and measurement space, which capture local and global image features for MRI reconstruction. Empirically, we achieve consistent performance across different undersampling rates and patterns, with an average 11% SSIM and 4 dB PSNR improvement over a state-of-the-art, End-to-End VarNet. For efficiency, our inference speed is also 1,400x faster than diffusion methods. The resolution-agnostic design also enhances zero-shot super-resolution and extended field of view in reconstructed images. Our unified model offers a versatile solution …
Poster
Hastings Greer · Lin Tian · François-Xavier Vialard · Roland Kwitt · Raúl San José Estépar · Marc Niethammer
[ ExHall D ]
Abstract
Image registration estimates spatial correspondences between image pairs. These estimates are typically obtained via numerical optimization or regression by a deep network. A desirable property is that a correspondence estimate (e.g., the true oracle correspondence) for an image pair is maintained under deformations of the input images. Formally, the estimator should be equivariant to a desired class of image transformations. In this work, we present careful analyses of equivariance properties in the context of multi-step deep registration networks. Based on these analyses we 1) introduce the notions of [U,U] equivariance (network equivariance to the same deformations of the input images) and [W,U] equivariance (where input images can undergo different deformations); we 2) show that in a suitable multi-step registration setup it is sufficient for overall [W,U] equivariance if the first step has [W,U] equivariance and all others have [U,U] equivariance; we 3) show that common displacement-predicting networks only exhibit [U,U] equivariance to translations instead of the more powerful [W,U] equivariance; and we 4) show how to achieve multi-step [W,U] equivariance via a coordinate-attention mechanism combined with displacement-predicting networks. Our approach obtains excellent practical performance for 3D abdomen, lung, and brain medical image registration. We match or outperform state-of-the-art (SOTA) registration …