Processing math: 100%
Skip to yearly menu bar Skip to main content


Timezone: America/Chicago

Registration Desk: Registration / Badge Pickup Sun 15 Jun 07:30 a.m.  


Oral Session 5A: Generative AI Sun 15 Jun 09:00 a.m.  

Oral
Bingliang Zhang · Wenda Chu · Julius Berner · Chenlin Meng · Anima Anandkumar · Yang Song

[ Karl Dean Ballroom ]

Abstract
Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems.
Oral
Zhendong Wang · Jianmin Bao · Shuyang Gu · Dong Chen · Wengang Zhou · Houqiang Li

[ Karl Dean Ballroom ]

Abstract
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
Oral
Lingjie Kong · Kai WU · Chengming Xu · Xiaobin Hu · Wenhui Han · Jinlong Peng · Donghao Luo · Mengtian Li · Jiangning Zhang · Chengjie Wang · Yanwei Fu

[ Karl Dean Ballroom ]

Abstract
Recent advances in diffusion-based text-to-image models have simplified creating high-fidelity images, but preserving the identity (ID) of specific elements, like a personal dog, is still challenging.Object customization, using reference images and textual descriptions, is key to addressing this issue. Current object customization methods are either object-specific, requiring extensive fine-tuning, or object-agnostic, offering zero-shot customization but limited to specialized domains. The primary issue of promoting zero-shot object customization from specific domains to the general domain is to establish a large-scale general ID dataset for model pre-training, which is time-consuming and labor-intensive. In this paper, we propose a novel pipeline to construct a large dataset of general objects and build the Multi-Category ID-Consistent (MC-IDC) dataset, featuring 315k text-image samples across 10k categories. With the help of MC-IDC, we introduce Customizing Anything (CustAny), a zero-shot framework that maintains ID fidelity and supports flexible text editing for general objects. CustAny features three key components: a general ID extraction module, a dual-level ID injection module, and an ID-aware decoupling module, allowing it to customize any object from a single reference image and text prompt. Experiments demonstrate that CustAny outperforms existing methods in both general object customization and specialized domains like human customization and virtual try-on. …
Oral
Soobin Um · Jong Chul Ye

[ Karl Dean Ballroom ]

Abstract
We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of *text-conditional* data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.
Oral
Andreas Müller · Denis Lukovnikov · Jonas Thietke · Asja Fischer · Erwin Quiring

[ Karl Dean Ballroom ]

Abstract
Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.

Oral Session 5B: Learning Systems and Medical Applications Sun 15 Jun 09:00 a.m.  

Oral
Hao Lin · Ke Wu · Jie Li · Jun Li · Wu-Jun Li

[ ExHall A2 ]

Abstract
Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80× in throughput and reduces strategy optimization time by up to 107× across five Transformer-based models.
Oral
Yanbiao Ma · Wei Dai · Wenke Huang · Jiayi Chen

[ ExHall A2 ]

Abstract
Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence.
Oral
Kai Zhao · zhihao zhuang · Miao Zhang · Chenjuan Guo · Yang Shu · Bin Yang

[ ExHall A2 ]

Abstract
Model quantization is an effective way to compress deep neural networks and accelerate the inference time on edge devices. Existing quantization methods usually require original data for calibration during the compressing process, which may be inaccessible due to privacy issues. A common way is to generate calibration data to mimic the origin data. However, the generators in these methods have the mode collapse problem, making them unable to synthesize diverse data. To solve this problem, we leverage the information from the full-precision model and enhance both inter-class and intra-class diversity for generating better calibration data, by devising a multi-layer features mixer and normalization flow based attention. Besides, novel regulation losses are proposed to make the generator produce diverse data with more patterns from the perspective of activated feature values and for the quantized model to learn better clip ranges adaptive to our diverse calibration data. Extensive experiments show that our method achieves state-of-the-art quantization results for both Transformer and CNN architectures. In addition, we visualize the generated data to verify that our strategies can effectively handle the mode collapse issue. Our codes are available at https://anonymous.4open.science/r/DFQ-84E6 and will be publicly available.
Oral
Meilong Xu · Saumya Gupta · Xiaoling Hu · Chen Li · Shahira Abousamra · Dimitris Samaras · Prateek Prasanna · Chao Chen

[ ExHall A2 ]

Abstract
Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Fréchet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships.
Oral
Aishik Konwer · Zhijian Yang · Erhan Bas · Cao Xiao · Prateek Prasanna · Parminder Bhatia · Taha Kass-Hout

[ ExHall A2 ]

Abstract
Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.

Oral Session 5C: Visual and Spatial Computing Sun 15 Jun 09:00 a.m.  

Oral
Junying Wang · Hongyuan Zhang · Yuan Yuan

[ Davidson Ballroom ]

Abstract
Recent personalized portrait generation methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and customized portrait generation, we develop a multi-modal image customizer capable of generating controllable fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into customized portrait generation. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
Oral
Shoichiro Takeda · Yasunori Akagi

[ Davidson Ballroom ]

Abstract
We propose novel fast algorithms for the Gromov–Wasserstein problem (GW) using cyclic symmetry of input data. Such GW with cyclic symmetry naturally appears as an object matching task underlying various real-world computer vision applications, e.g., image registration, point cloud registration, stereo matching, and 3D reconstruction. Gradient-based algorithms have been used to solve GW, and our main idea is to use the following remarkable and non-trivial property: By setting the initial solution to have cyclic symmetry, all intermediate solutions and matrices appearing in the gradient-based algorithms have the same cyclic symmetry until convergence. Based on this property, our gradient-based algorithms restrict the solution space to have cyclic symmetry and update only one of the symmetric parts of solutions and matrices at each iteration, which results in fast computation. Furthermore, the original gradient-based algorithms and ours must solve the Optimal Transport problem (OT) at each iteration, but only in ours does this problem exhibit cyclic symmetry. This cyclic OT can be solved efficiently, and as a result, the total computational time of our algorithms is dramatically faster than the original ones. Experiments showed the effectiveness of our algorithms in synthetic and real-world data with strict and approximate cyclic symmetry, respectively.
Oral
Runfeng Li · Mikhail Okunev · Zixuan Guo · Anh H Duong · Christian Richardt · Matthew O’Toole · James Tompkin

[ Davidson Ballroom ]

Abstract
We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight cameras using raw sensor samples that is as accurate as past methods and is 100× faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. Recent 3D Gaussian splatting methods often depend on multi-view data to produce satisfactory results and are brittle in their optimizations otherwise.In time-of-flight radiance field reconstruction, the property of interest---depth---is not directly optimized, causing additional challenges.We describe how these problems have a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussians.Then, we incorporate two heuristics into our optimization to improve the accuracy of scene geometry for under-constrained time-of-flight Gaussians.Experimental results show that our approach produces accurate reconstructions under constrained sensing conditions, including for fast motions like swinging baseball bats.
Oral
Yiqing Liang · Abhishek Badki · Hang Su · James Tompkin · Orazio Gallo

[ Davidson Ballroom ]

Abstract
Foundation models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such model exists for scene flow.Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well.We solve three challenges to fix this problem.First, we create a method that jointly estimates geometry and motion for accurate prediction.Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes.Third, we evaluate different parameterizations for scene flow prediction and identify a natural and effective parameterization.Our resulting model outperforms existing methods as well baselines built on foundation models in term of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP.Overall, this makes scene flow prediction significantly more practical for in-the-wild use.
Oral
Jialin Zhu · Jiangbei Yue · Feixiang He · He Wang

[ Davidson Ballroom ]

Abstract
Recently, 3D Gaussian Splatting (3DGS) provides a new framework for novel view synthesis, and has spiked a new wave of research in neural rendering and related applications. As 3DGS is becoming a foundational component of many models, any improvement on 3DGS itself can bring huge benefits. To this end, we aim to improve the fundamental paradigm and formulation of 3DGS. We argue that as an unnormalized mixture model, it needs to be neither Gaussians nor splatting. We subsequently propose a new mixture model consisting of flexible Student's t distributions, with both positive (splatting) and negative (scooping) densities. We name our model Student Splatting and Scooping, or SSS. When providing better expressivity, SSS also poses new challenges in learning. Therefore, we also propose a new principled sampling approach for optimization. Through exhaustive evaluation and comparison, across multiple datasets, settings, and metrics, we demonstrate that SSS outperforms existing methods in terms of quality and parameter efficiency, e.g. achieving matching or better quality with similar numbers of components, and obtaining comparable results while reducing the component number by as much as 82%.

Poster Session 5 Sun 15 Jun 10:30 a.m.  

Poster
Ziqiao Peng · Yanbo Fan · Haoyu Wu · Xuan Wang · Hongyan Liu · Jun He · Zhaoxin Fan

[ ExHall D ]

Abstract
In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task—multi-round dual-speaker interaction for 3D talking head generation—which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. Code and dataset will be released upon acceptance.
Poster
Lee Chae-Yeon · Hyun-Bin Oh · EunGi Han · Kim Sung-Bin · Suekyeong Nam · Tae-Hyun Oh

[ ExHall D ]

Abstract
Recent advancements in speech-driven 3D talking head generation have achieved impressive advance in lip synchronization. However, existing models still fall short in capturing a perceptual alignment between diverse speech characteristics and lip movements. In this work, we define essential criteria—temporal synchronization, lip readability, and expressiveness— for perceptually accurate lip movements in response to speech signals. We also introduce a speech-mesh synchronized representation that captures the intricate correspondence between speech and facial mesh. We plug in this representation as a perceptual loss to guide lip movements, ensuring they are perceptually aligned with the given speech. Additionally, we utilize this representation as a perceptual metric and introduce two other physically-grounded lip synchronization metrics to evaluate these three criteria. Experiments demonstrate that training 3D talking head models with our perceptual loss significantly enhances all three aspects of perceptually accurate lip synchronization. Codes will be released if accepted.
Poster
Dingcheng Zhen · Shunshun Yin · Shiyang Qin · Hou Yi · Ziwei Zhang · Siyuan Liu · Gan Qi · Ming Tao

[ ExHall D ]

Abstract
In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion.Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements.Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments …
Poster
Jiahao Cui · Hui Li · Qingkun Su · Hanlin Shang · Kaihui Cheng · Yuqi Ma · Shan Mu · Hang Zhou · Jingdong Wang · Siyu Zhu

[ ExHall D ]

Abstract
Existing methodologies for animating portrait images encounter significant challenges, particularly in addressing non-frontal perspectives, rendering dynamic objects surrounding the portrait, and generating immersive, realistic backgrounds across various scenarios. This paper proposes a novel approach that integrates a diffusion framework with a transformer-based architecture to enhance the realism and dynamism of portrait animations. Our methodology introduces three key innovations. First, we employ speech audio conditioning through cross-attention mechanisms to ensure precise alignment between audio signals and facial dynamics. Second, we incorporate an identity reference network into the diffusion transformer framework, thereby preserving facial identity consistently across video sequences. Third, our approach facilitates long-duration video extrapolation through motion frames, enabling the generation of extended video clips. We validated our method through experiments conducted on benchmark datasets and newly proposed wild datasets, demonstrating substantial improvements over previous methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes.
Poster
Shuyuan Tu · Zhen Xing · Xintong Han · Zhi-Qi Cheng · Qi Dai · Chong Luo · Zuxuan Wu

[ ExHall D ]

Abstract
Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
Poster
Yuan Li · Ziqian Bai · Feitong Tan · Zhaopeng Cui · Sean Fanello · Yinda Zhang

[ ExHall D ]

Abstract
We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets.Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently.To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information.Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.
Poster
yating wang · Xuan Wang · Ran Yi · Yanbo Fan · Jichen Hu · Jingcheng Zhu · Lizhuang Ma

[ ExHall D ]

Abstract
Recent studies have combined 3D Gaussian and 3D Morphable Models (3DMM) to construct high-quality 3D head avatars. In this line of research, existing methods either fail to capture the dynamic textures or incur significant overhead in terms of runtime speed or storage space. To this end, we propose a novel method that addresses all the aforementioned demands. In specific, we introduce an expressive and compact representation that encodes texture-related attributes of the 3D Gaussians in the tensorial format. We store appearance of neutral expression in static tri-planes, and represents dynamic texture details for different expressions using lightweight 1D feature lines, which are then decoded into opacity offset relative to the neutral face. We further propose adaptive truncated opacity penalty and class-balanced sampling to improve generalization across different expressions. Experiments show this design enables accurate face dynamic details capturing while maintains real-time rendering and significantly reduces storage costs, thus broadening the applicability to more scenarios.
Poster
Di Liu · Teng Deng · Giljoo Nam · Yu Rong · Stanislav Pidhorskyi · Junxuan Li · Jason Saragih · Dimitris N. Metaxas · Chen Cao

[ ExHall D ]

Abstract
Photorealistic 3D head avatar reconstruction faces critical challenges in modeling dynamic face-hair interactions and achieving cross-identity generalization, particularly during expressions and head movements. We present LUCAS, a novel Universal Prior Model (UPM) for codec avatar modeling that disentangles face and hair through a layered representation. Unlike previous UPMs that treat hair as an integral part of the head, our approach separates the modeling of the hairless head and hair into distinct branches. LUCAS is the first to introduce a mesh-based UPM, facilitating real-time rendering on devices. LUCAS can be integrated with Gaussian Splatting to enhance visual fidelity, a feature particularly beneficial for rendering complex hairstyles. Experimental results indicate that LUCAS outperforms existing single-mesh and Gaussian-based avatar models in both quantitative and qualitative assessments, including evaluations on held-out subjects in zero-shot driving scenarios. LUCAS demonstrates superior dynamic performance in managing head pose changes, expression transfer, and hairstyle variations, thereby advancing the state-of-the-art in 3D head avatar reconstruction.
Poster
SooHyun Lee · SeoYeon Kim · HeeKyung Lee · Won-Sik Cheong · Joo Ho Lee

[ ExHall D ]

Abstract
Multi-person avatar reconstruction from sparse multiview videos is challenging.The independent reconstruction of individual avatars often fails to capture the geometric relationships among multiple instances, resulting in inter-penetrations between avatars.While some researchers have resolved this issue using neural volumetric rendering techniques, these approaches suffer from huge computational costs for rendering and training.In this paper, we propose a multi-person avatar reconstruction method that reconstructs 3D avatars while preserving the geometric relations between people.Our 2D Gaussian Splatting (2DGS)-based avatar representation allows us to represent geometrically accurate surfaces of multiple instances that support sharp inside-outside tests.To efficiently influence the occluded instances, we design a differentiable multi-layer alpha blending system compatible with the GS rendering pipeline.We mitigate inter-penetrations among avatars by penalizing segmentation discrepancies and seeing through near-contact regions to reveal penetrating parts.We also utilize monocular priors to enhance quality in less-observed and textureless surfaces.Our proposed method achieves fast reconstruction while maintaining state-of-the-art performance in terms of geometry and rendering quality.We demonstrate the efficiency and effectiveness of our method on a multi-person dataset containing close interactions.
Poster
Lingteng Qiu · Shenhao Zhu · Qi Zuo · Xiaodong Gu · Yuan Dong · Junfei Zhang · Chao Xu · Zhe Li · Weihao Yuan · Liefeng Bo · Guanying Chen · Zilong Dong

[ ExHall D ]

Abstract
Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based Text-to-Video model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale monocular video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.
Poster
Zhichao Zhai · Guikun Chen · Wenguan Wang · Dong Zheng · Jun Xiao

[ ExHall D ]

Abstract
Decoupling from customized parametric templates represents a crucial step toward the creation of fully flexible, animatable articulated models. While existing template-free methods can achieve high-fidelity reconstruction in observed views, they struggle to recover plausible canonical models, resulting in suboptimal animation quality. This limitation stems from overlooking the fundamental ambiguities in canonical reconstruction, where multiple canonical models could explain the same observed views. This work reveals the entanglement between canonical ambiguities and incorrect skinning, and presents a self-supervised framework that learns both plausible skinning and accurate canonical geometry using only sparse pose data. Our method, TAGA, uses explicit 3D Gaussians as skinning carriers and characterizes ambiguities as "Ambiguous Gaussians" with incorrect skinning weights. TAGA then corrects ambiguous Gaussians in the observation space using anomaly detection. With the corrected ones, we enforce cycle consistency constraints on both geometry and skinning to refine the corresponding Gaussians in the canonical space through a new backward method. Compared to existing state-of-the-art template-free methods, TAGA delivers superior visual fidelity for novel views and poses, while significantly improving training and rendering speeds. Experiments on challenging datasets with limited pose variations further demonstrate the robustness and generality of TAGA. The code will be released.
Poster
Mingze Sun · Junting Dong · Junhao Chen · Yurun Chen · Xinyu Jiang · Shiwei Mao · Puhua Jiang · Jingbo Wang · Bo Dai · Ruqi Huang

[ ExHall D ]

Abstract
Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.
Poster
Yifang Men · Yuan Yao · Miaomiao Cui · Liefeng Bo

[ ExHall D ]

Abstract
Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for flexible controls, pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize realistic character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip into three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized …
Poster
Satyajit Tourani · Siddharth Tourani · Arif Mahmood · Muhammad Haris Khan

[ ExHall D ]

Abstract
Unsupervised landmark and head pose estimation is fundamental in fields like biometrics, augmented reality, and emotion recognition, offering accurate spatial data without relying on labeled datasets. It enhances scalability, adaptability, and generalization across diverse settings, where manual labeling is costly. In this work we exploit Stable Diffusion to approach the challenging problem of unsupervised landmarks and head pose estimation and make following contributions. (a) We propose a semantic-aware landmark localization algorithm including a consistent landmarks selection technique. (b) To encode landmarks and their holistic configuration, we propose learning image-aware textual embedding. (c) A novel algorithm for landmarks-guided 3D head pose estimation is also proposed. (d) We refine the landmarks using head pose by innovating a 3D rendering based augmentation and pose-based batching technique while the refined landmarks, consequently improving the head pose. (e) We report a new state-of-the-art in unsupervised facial landmark estimation across five challenging datasets including AFLW2000, MAFL, Cat-Heads, LS3D and a facial landmark tracking benchmark 300VW. In unsupervised head pose estimation, we outperform existing methods on BIWI and AFLW2000 by visible margins. Moreover, our method provides a significant training speed-up over the existing best unsupervised landmark detection method.
Poster
Yuxi Mi · Zhizhou Zhong · Yuge Huang · Qiuyang Yuan · Xuan Zhao · Jianqing Xu · Shouhong Ding · ShaoMing Wang · Rizen Guo · Shuigeng Zhou

[ ExHall D ]

Abstract
Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator's responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.
Poster
Michelle Guo · Matt Jen-Yuan Chiang · Igor Santesteban · Nikolaos Sarafianos · Hsiaoyu Chen · Oshri Halimi · Aljaž Božič · Shunsuke Saito · Jiajun Wu · Karen Liu · Tuur Stuyck · Egor Larionov

[ ExHall D ]

Abstract
We introduce a novel approach to reconstruct simulation-ready garments with intricate appearance. Despite recent advancements, existing methods often struggle to balance the need for accurate garment reconstruction with the ability to generalize to new poses and body shapes or require large amounts of data to achieve this. In contrast, our method only requires a multi-view capture of a single static frame. We represent garments as hybrid mesh-embedded 3D Gaussian splats (or simply Gaussians), where the Gaussians capture near-field shading and high-frequency details, while the mesh encodes far-field albedo and optimized reflectance parameters. We achieve novel pose generalization by exploiting the mesh from our hybrid approach, enabling physics-based simulation and surface rendering techniques, while also capturing fine details with Gaussians that accurately reconstruct garment details. Our optimized garments can be used for simulating garments on novel poses, and garment relighting.
Poster
Zeqing Wang · Qingyang Ma · Wentao Wan · Haojie Li · Keze Wang · Yonghong Tian

[ ExHall D ]

Abstract
Recent improvements in visual synthesis have significantly enhanced the depiction of generated human photos, which are pivotal due to their wide applicability and demand. Nonetheless, the existing text-to-image or text-to-video models often generate low-quality human photos that might differ considerably from real-world body structures, referred to as abnormal human bodies''. Such abnormalities, typically deemed unacceptable, pose considerable challenges in the detection and repair of them within human photos. These challenges require precise abnormality recognition capabilities, which entail pinpointing both the location and the abnormality type. Intuitively, Visual Language Models (VLMs) that have obtained remarkable performance on various visual tasks are quite suitable for this task. However, their performance on abnormality detection in human photos is quite poor.Hence, it is quite important to highlight this task for the research community. In this paper, we first introduce a simple yet challenging task, i.e., \textbf{F}ine-grained \textbf{H}uman-body \textbf{A}bnormality \textbf{D}etection \textbf{(FHAD)}, and construct two high-quality datasets for evaluation. Then, we propose a meticulous framework, named HumanCalibrator, which identifies and repairs abnormalities in human body structures while preserving the other content. Experiments indicate that our HumanCalibrator achieves high accuracy in abnormality detection and accomplishes an increase in visual comparisons while preserving the other visual content.
Poster
Nannan Li · Kevin Shih · Bryan A. Plummer

[ ExHall D ]

Abstract
Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment.Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schr\"odinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schr\"odinger Bridge's noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB …
Poster
Yuanwei Liu · Hui Wei · Chengyu Jia · Ruqi Xiao · Weijian Ruan · Xingxing Wei · Joey Tianyi Zhou · Zheng Wang

[ ExHall D ]

Abstract
Previous physical adversarial attacks have shown that carefully crafted perturbations can deceive face recognition systems, revealing critical security vulnerabilities. However, these attacks often struggle to impersonate multiple targets and frequently fail to bypass liveness detection. For example, attacks using human-skin masks are challenging to fabricate, inconvenient to swap between users, and often fail liveness detection due to facial occlusions. A projector, however, can generate content-rich light without obstructing the face, making it ideal for non-intrusive attacks. Thus, we propose a novel physical adversarial attack using a projector and explore the superposition of projected and natural light to create adversarial facial images. This approach eliminates the need for physical artifacts on the face, effectively overcoming these limitations. Specifically, our proposed ProjAttacker generates adversarial 3D textures that are projected onto human faces. To ensure physical realizability, we introduce a light reflection function that models complex optical interactions between projected light and human skin, accounting for reflection and diffraction effects. Furthermore, we incorporate camera Image Signal Processing (ISP) simulation to maintain the robustness of adversarial perturbations across real-world diverse imaging conditions. Comprehensive evaluations conducted in both digital and physical scenarios validate the effectiveness of our method. Codes will be publicly available.
Poster
Yu-Cheng Chiu · GUAN-RONG CHEN · Zihao Chen · Yan-Tsung Peng

[ ExHall D ]

Abstract
The primary goal of white balance (WB) for sRGB images is to correct inaccurate color temperatures, making images exhibit natural, neutral colors. While existing WB methods achieve reasonable results, they are limited by the global color adjustments applied during a camera’s post-sRGB processing and the restricted color diversity in current datasets, often leading to suboptimal color correction, particularly in images with pronounced color shifts. To address these limitations, we propose an Auxiliary Bimodal Cross-domain Transformer (ABC-Former) that enhances WB correction by leveraging complementary knowledge from multiple modalities. ABC-Former employs two auxiliary models to extract global color information from CIELab and RGB color histograms, complementing the primary model’s sRGB input processing. We introduce an Interactive Channel Attention (ICA) module to facilitate cross-modality knowledge transfer, integrating calibrated color features into image features for more precise WB results. Experimental evaluations on benchmark WB datasets show that ABC-Former achieves superior performance, outperforming state-of-the-art WB methods.
Poster
Rui Xu · Yuzhen Niu · Yuezhou Li · Huangbiao Xu · Wenxi Liu · Yuzhong Chen

[ ExHall D ]

Abstract
Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.
Poster
Guanzhou Lan · Qianli Ma · YUQI YANG · Zhigang Wang · Dong Wang · Xuelong Li · Bin Zhao

[ ExHall D ]

Abstract
The computational burden of the iterative sampling process remains a major challenge in diffusion-based Low-Light Image Enhancement (LLIE). Current acceleration methods, whether training-based or training-free, often lead to significant performance degradation, highlighting the trade-off between performance and efficiency.In this paper, we identify two primary factors contributing to performance degradation: fitting errors and the inference gap. Our key insight is that fitting errors can be mitigated by linearly extrapolating the incorrect score functions, while the inference gap can be reduced by shifting the Gaussian flow to a reflectance-aware residual space.Based on the above insights, we design Reflectance-Aware Trajectory Refinement (RATR) module, a simple yet effective module to refine the teacher trajectory using the reflectance component of images. Following this, we introduce Reflectance-aware Diffusion with Distilled Trajectory ReDDiT, an efficient and flexible distillation framework tailored for LLIE. Our framework achieves comparable performance to previous diffusion-based methods with redundant steps in just 2 steps while establishing new state-of-the-art (SOTA) results with 8 or 4 steps. Comprehensive experimental evaluations on 10 benchmark datasets validate the effectiveness of our method, consistently outperforming existing SOTA methods.
Poster
Hesong Li · Ziqi Wu · Ruiwen Shao · Tao Zhang · Ying Fu

[ ExHall D ]

Abstract
Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements, and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in frequency domain formed by periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM …
Poster
Yujie Wang · Praneeth Chakravarthula · Baoquan Chen

[ ExHall D ]

Abstract
Gaussian Splatting techniques have recently enabled high-quality 3D scene reconstruction and real-time novel view synthesis. These approaches, however, are limited by the pinhole camera model and lacks support for modeling and rendering defocus effects. Departing from this, we introduce DOF-GS --- a new framework that aussian Splatting with a finite-aperture camera model and explicit, differentiable defocus rendering, enabling it to function as a post-capture control tool. DOF-GS enables dynamic depth-of-field (DOF) adjustment through on-demand post-capture aperture and focal distance control for the first time, to the best of our knowledge. By using multi-view images with moderate defocus blur as input, our framework learns inherent camera characteristics and reconstruct sharp details of the underlying scene, particularly, enabling rendering with varying DOF effects, post-capture and optimization. Additionally, our framework extracts circle-of-confusion cues during optimization to identify in-focus regions in input views, enhancing the reconstructed 3D scene details. Experimental results demonstrate that DOF-GS supports post-capture refocusing, adjustable defocus and high-quality all-in-focus rendering, from multi-view images with uncalibrated defocus blur.
Poster
Jingzhi Li · Zongwei Wu · Eduard Zamfir · Radu Timofte

[ ExHall D ]

Abstract
Accurate 3D objects relighting in diverse unseen environments is crucial for realistic virtual object placement. Due to the albedo-lighting ambiguity, existing methods often fall short in producing faithful relights. Without proper constraints, observed training views can be explained by numerous combinations of lighting and material attributes, lacking physical correspondence with the actual environment maps used for relighting. In this work, we present ReCap, treating cross-environment captures as multi-task target to provide the missing supervision that cuts through the entanglement. Specifically, ReCap jointly optimizes multiple lighting representations that share a common set of material attributes. This naturally harmonizes a coherent set of lighting representations around the mutual material attributes, exploiting commonalities and differences across varied object appearances. Such coherence enables physically sound lighting reconstruction and robust material estimation — both essential for accurate relighting. Together with a streamlined shading function and effective post-processing, ReCap outperforms the leading competitor by 3.4 dB in PSNR on an expanded relighting benchmark.
Poster
Yue Fan · Ningjing Fan · Ivan Skorokhodov · Oleg Voynov · Savva Ignatyev · Evgeny Burnaev · Peter Wonka · Yiqun Wang

[ ExHall D ]

Abstract
We develop a method that recovers the surface, materials, and illumination of a scene from its posed multi-view images. In contrast to prior work, it does not require any additional data and can handle glossy objects or bright lighting. It is a progressive inverse rendering approach, which consists of three stages. In the first stage, we reconstruct the scene radiance and signed distance function (SDF) with a novel regularization strategy for specular reflections. We propose to explain a pixel color using both surface and volume rendering jointly, which allows for handling complex view-dependent lighting effects for surface reconstruction. In the second stage, we distill light visibility and indirect illumination from the learned SDF and radiance field using learnable mapping functions. Finally, we design a method for estimating the ratio of incoming direct light reflected in a specular manner and use it to reconstruct the materials and direct illumination. Experimental results demonstrate that the proposed method outperforms the current state-of-the-art in recovering surfaces, materials, and lighting without relying on any additional data.
Poster
Cheng-De Fan · Chen-Wei Chang · Yi-Ruei Liu · Jie-Ying Lee · Jiun-Long Huang · Yu-Chee Tseng · Yu-Lun Liu

[ ExHall D ]

Abstract
We present SpectroMotion, a novel approach that combines 3D Gaussian Splatting (3DGS) with physically-based rendering (PBR) and deformation fields to reconstruct dynamic specular scenes. Previous methods extending 3DGS to model dynamic scenes have struggled to represent specular surfaces accurately. Our method addresses this limitation by introducing a residual correction technique for accurate surface normal computation during deformation, complemented by a deformable environment map that adapts to time-varying lighting conditions. We implement a coarse-to-fine training strategy significantly enhancing scene geometry and specular color prediction. It is the only existing 3DGS method capable of synthesizing photorealistic real-world dynamic specular scenes, outperforming state-of-the-art methods in rendering complex, dynamic, and specular scenes.
Poster
Xingyu Chen · Zihao Feng · Kun Qian · Xinyu Zhang

[ ExHall D ]

Abstract
Radio frequency (RF) propagation modeling poses unique electromagnetic simulation challenges. While recent neural representations have shown success in visible spectrum rendering, the fundamentally different scales and physics of RF signals require novel modeling paradigms. In this paper, we introduce RFScape, a novel framework that bridges the gap between neural scene representation and RF propagation modeling. Our key insight is that complex RF-object interactions can be captured through object-centric neural representations while preserving the composability of traditional ray tracing. Unlike previous approaches that either rely on crude geometric approximations or require dense spatial sampling of entire scenes, RFScape learns per-object electromagnetic properties and enables flexible scene composition. Through extensive evaluation on real-world RF testbeds, we demonstrate that our approach achieves 13 dB improvement over conventional ray tracing and 5 dB over state-of-the-art neural baselines in modeling accuracy, while requiring only sparse training samples.
Poster
You Wang · Li Fang · Hao Zhu · Fei Hu · Long Ye · Zhan Ma

[ ExHall D ]

Abstract
Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. We will open-source our code upon the paper's acceptance.
Poster
Jan Held · Renaud Vandeghen · Abdullah J Hamdi · Anthony Cioppa · Adrien Deliege · Silvio Giancola · Andrea Vedaldi · Bernard Ghanem · Marc Van Droogenbroeck

[ ExHall D ]

Abstract
Recent advances in radiance field reconstruction, such as 3D Gaussian Splatting (3DGS), have achieved high-quality novel view synthesis and fast rendering by representing scenes with compositions of Gaussian primitives. However, 3D Gaussians present several limitations for scene reconstruction. Accurately capturing hard edges is challenging without significantly increasing the number of Gaussians, creating a large memory footprint. Moreover, they struggle to represent flat surfaces, as they are diffused in space. Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. Smooth convex shapes offer greater flexibility than Gaussians, allowing for a better representation of 3D scenes with hard edges and dense volumes using fewer primitives. Powered by our efficient CUDA-based rasterizer, 3DCS achieves superior performance over 3DGS on benchmarks such as Mip-NeRF360, Tanks and Temples, and Deep Blending. Specifically, our method attains an improvement of up to 0.81 in PSNR and 0.026 in LPIPS compared to 3DGS while maintaining high rendering speeds and reducing the number of required primitives. Our results highlight the potential of 3D Convex Splatting to become …
Poster
Stefano Esposito · Anpei Chen · Christian Reiser · Samuel Rota Bulò · Lorenzo Porzi · Katja Schwarz · Christian Richardt · Michael Zollhoefer · Peter Kontschieder · Andreas Geiger

[ ExHall D ]

Abstract
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed order. First, we model surface layers as SDF shells with optimal spacing learned during training. Then, we bake them as meshes and fit UV textures. Unlike single-surface methods, our multi-layer representation effectively models fuzzy objects. In contrast to volume-based and splatting-based methods, our approach enables real-time rendering on low-cost smartphones.
Poster
Shu Wang · Yanbo Gao · Shuai Li · Chong Lv · Xun Cai · chuankun Li · Hui Yuan · jinglin zhang

[ ExHall D ]

Abstract
This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability.
Poster
Xiangjun Gao · Xiaoyu Li · Yiyu Zhuang · Qi Zhang · Wenbo Hu · Chaopeng Zhang · Yao Yao · Ying Shan · Long Quan

[ ExHall D ]

Abstract
Neural 3D representations such as Neural Radiance Fields (NeRFs), excel at producing photo-realistic rendering results but lack the flexibility for manipulation and editing which is crucial for content creation. Previous works have attempted to address this issue by deforming a NeRF in canonical space or manipulating the radiance field based on an explicit mesh. However, manipulating NeRF is not highly controllable and requires a long training and inference time. With the emergence of 3D Gaussian Splatting (3DGS), extremely high-fidelity novel view synthesis can be achieved using an explicit point-based 3D representation with much faster training and rendering speed. However, there is still a lack of effective means to manipulate 3DGS freely while maintaining rendering quality. In this work, we aim to tackle the challenge of achieving manipulable photo-realistic rendering. We propose to utilize a triangular mesh to manipulate 3DGS directly with self-adaptation. This approach reduces the need to design various algorithms for different types of Gaussian manipulation. By utilizing a triangle shape-aware Gaussian binding and adapting method, we can achieve 3DGS manipulation and preserve high-fidelity rendering after manipulation. Our approach is capable of handling large deformations, local manipulations, and even physics simulations while keeping high-quality rendering. Furthermore, we demonstrate that …
Poster
Dana Cohen-Bar · Daniel Cohen-Or · Gal Chechik · Yoni Kasten

[ ExHall D ]

Abstract
As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present TriTex, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh. Despite training on just one example, it generalizes effectively to diverse shapes within the same category. Extensive evaluation on our newly created benchmark dataset shows that TriTex achieves superior texture transfer quality and fast inference times compared to existing methods. Our approach advances single-example texture transfer, providing a practical solution for maintaining visual coherence across related 3D models in applications like game development and simulation.
Poster
Armin Shafiee Sarvestani · Sheyang Tang · Zhou Wang

[ ExHall D ]

Abstract
Mesh quality assessment (MQA) models play a critical role in the design, optimization, and evaluation of mesh operation systems in a wide variety of applications. Current MQA models, whether model-based methods using topology-aware features or projection-based approaches working on rendered 2D projections, often fail to capture the intricate interactions between texture and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid full-reference colored MQA framework that integrates model-based and projection-based approaches, capturing complex interactions between textural information and 3D structures for enriched quality representations. Our method employs graph learning to extract detailed 3D representations, which are then projected to 2D using a novel feature rendering process that precisely aligns them with colored projections. This enables the exploration of geometry-texture interactions via cross-attention, producing comprehensive mesh quality representations. Extensive experiments demonstrate HybridMQA’s superior performance across diverse datasets, highlighting its ability to effectively leverage geometry-texture interactions for a thorough understanding of mesh quality. Our implementation will be made publicly available.
Poster
Xiang Feng · Chang Yu · Zoubin Bi · Yintong Shang · Feng Gao · Hongzhi Wu · Kun Zhou · Chenfanfu Jiang · Yin Yang

[ ExHall D ]

Abstract
Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.
Poster
Jing Li · Yihang Fu · Falai Chen

[ ExHall D ]

Abstract
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps.
Poster
Yuan Li · Cheng Lin · Yuan Liu · Xiaoxiao Long · Chenxu Zhang · Ningna Wang · Xin Li · Wenping Wang · Xiaohu Guo

[ ExHall D ]

Abstract
The field of diffusion-based 3D generation has experienced tremendous progress in recent times. However, existing 3D generative models often produce overly dense and unstructured meshes, which are in stark contrast to the compact, structured and clear-edged CAD models created by human modelers. We introduce CADDreamer, a novel method for generating CAD objects from a single image. This method proposes a primitive-aware multi-view diffusion model, which perceives both local geometry and high-level structural semantics during the generation process. We encode primitive semantics into the color domain, and enforce the strong priors in pre-trained diffusion models to align with the well-defined primitives. As a result, we can infer multi-view normal maps and semantic maps from a single image, thereby reconstructing a mesh with primitive labels. Correspondingly, we propose a set of fitting and optimization methods to deal with the inevitable noise and distortion in generated primitives, ultimately producing a complete and seamless Boundary Representation (B-rep) of a Computer-Aided Design (CAD) model. Experimental results demonstrate that our method can effectively recover high-quality CAD objects from single-view images. Compared to existing 3D generation methods, the models produced by CADDreamer are compact in representation, clear in structure, sharp in boundaries, and watertight in topology.
Poster
Yiftach Edelstein · Or Patashnik · Dana Cohen-Bar · Lihi Zelnik-Manor

[ ExHall D ]

Abstract
Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method …
Poster
Jianfeng XIANG · Zelong Lv · Sicheng Xu · Yu Deng · Ruicheng Wang · Bowen Zhang · Dong Chen · Xin Tong · Jiaolong Yang

[ ExHall D ]

Abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation.The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
Poster
Xingyi Yang · Songhua Liu · Xinchao Wang

[ ExHall D ]

Abstract
The quality of 3D generative modeling has been notably improved by the adoption of 2D diffusion models. Despite this progress, the cumbersome optimization process \emph{per se}presents a critical problem to efficiency. In this paper, we introduce Hash3D, a universal acceleration for 3D score distillation sampling~(SDS) without model training.Central to Hash3D is the observation that images rendered from similar camera positions and diffusion time-steps often have redundant feature maps. By hashing and reusing these feature maps across nearby timesteps and camera angles, Hash3D eliminates unnecessary calculations. We implement this through an adaptive grid-based hashing. As a result, it largely speeds up the process of 3D generation. Surprisingly, this feature-sharing mechanism not only makes generation faster but also improves the smoothness and view consistency of the synthesized 3D objects. Our experiments covering 5 text-to-3D and 3 image-to-3D models, demonstrate Hash3D’s versatility to speed up optimization, enhancing efficiency by 1.54×. Additionally, Hash3D's integration with 3D Gaussian splatting largely speeds up 3D model creation, reducing text-to-3D conversion to about 10 minutes and image-to-3D conversion to 30 seconds.
Poster
Trong-Tung Nguyen · Quang Nguyen · Khoi Nguyen · Anh Tran · Cuong Pham

[ ExHall D ]

Abstract
Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results.
Poster
Weiran Guang · Xiaoguang Gu · Mengqi Huang · Zhendong Mao

[ ExHall D ]

Abstract
Interactive drag editing of images is a valuable task that has gained considerable attention for its precision and controllability. However, existing approaches have primarily focused on manipulating the shape or movement of objects in 2D plane. We propose to extend this drag-based editing task to 3D space. Firstly, we utilize the trajectory of two points to represent the rotational trajectory of the object. Gaussian maps of a circle and a square are centered at these two points, respectively. We use distinct shapes to ensure that symmetric views produce different object representations. Secondly, we introduce a lightweight mapping network to embed the object features into two Gaussian maps to obtain a continuous control condition that guides the model in learning the correspondence between the trajectory and the object. Finally, to overcome the limitations of current 3D object reconstruction datasets, which typically consist of object maps with transparent backgrounds, we affix random backgrounds to them. This modification helps improve the model's ability to ignore background interference when editing real images with complex backgrounds. Experiments demonstrate that our approach successfully achieves object rotation within the drag framework and demonstrates strong generalization to real-world images.
Poster
Yihua Huang · Mingxian Lin · Yangtian Sun · Ziyi Yang · Xiaoyang Lyu · Yan-Pei Cao · Xiaojuan Qi

[ ExHall D ]

Abstract
Recently, Gaussian splatting has emerged as a robust technique for representing 3D scenes, enabling real-time rasterization and high-fidelity rendering. However, Gaussians' inherent radial symmetry and smoothness constraints limit their ability to represent complex shapes, often requiring thousands of primitives to approximate detailed geometry. We introduce Deformable Radial Kernel (DRK), which extends Gaussian splatting into a more general and flexible framework. Through learnable radial bases with adjustable angles and scales, DRK efficiently models diverse shape primitives while enabling precise control over edge sharpness and boundary curvature. iven DRK's planar nature, we further develop accurate ray-primitive intersection computation for depth sorting and introduce efficient kernel culling strategies for improved rasterization efficiency. Extensive experiments demonstrate that DRK outperforms existing methods in both representation efficiency and rendering quality, achieving state-of-the-art performance while dramatically reducing primitive count.
Poster
Hyojun Go · byeongjun park · Jiho Jang · Jin-Young Kim · Soonwoo Kwon · Changick Kim

[ ExHall D ]

Abstract
Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts—thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks—including object editing, novel view synthesis, and camera pose estimation—within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
Poster
Alex Hanson · Allen Tu · Geng Lin · Vasu Singla · Matthias Zwicker · Tom Goldstein

[ ExHall D ]

Abstract
3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians.However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings.In this paper, we identify and address two key inefficiencies in 3D-GS, achieving substantial improvements in rendering speed, model size, and training time.First, we optimize the rendering pipeline to precisely localize Gaussians in the scene, boosting rendering speed without altering visual fidelity.Second, we introduce a novel pruning technique and integrate it into the training pipeline, significantly reducing model size and training time while further raising rendering speed.Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic 6.71× across scenes from the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets with 10.6× fewer primitives than 3D-GS.
Poster
Jinguang Tong · Xuesong li · Fahira Afzal Maken · Sundaram Muthu · Lars Petersson · Chuong Nguyen · Hongdong Li

[ ExHall D ]

Abstract
3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from a foundation model. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster.
Poster
Yiyang Shen · Kun Zhou · He Wang · Yin Yang · Tianjia Shao

[ ExHall D ]

Abstract
Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in …
Poster
Yifan Liu · Keyu Fan · Weihao Yu · Chenxin Li · Hao Lu · Yixuan Yuan

[ ExHall D ]

Abstract
Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components working in harmony: a Mono-Multi Feature Adapter that transforms monocular features into cross-view-aware multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods, while maintaining computational efficiency with minimal trainable parameters. We will make our codes and models publicly available.
Poster
Han Zhou · Wei Dong · Jun Chen

[ ExHall D ]

Abstract
Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality normally-exposed representation due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, huge noise, and color distortion poses substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS can not achieve satisfactory performance due to their individual enhancement process, which leads to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose \textbf{LITA-GS}, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively surpass the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpass the state-of-the-art (SOTA) NeRF-based method by 1.7 dB in PSNR and 0.09 in SSIM while enjoying faster …
Poster
Zheng Chen · Chenming Wu · Zhelun Shen · Chen Zhao · Weicai Ye · Haocheng Feng · Errui Ding · Song-Hai Zhang

[ ExHall D ]

Abstract
Wide-baseline panoramic images are commonly used in applications such as VR and simulation rendering to reduce network bandwidth and storage requirements. However, synthesizing novel views from these panoramic images in real time remains a significant challenge, especially due to the high resolution and inherent distortions of panoramic imagery. Although existing 3D Gaussian splatting (3DGS) methods can produce photo-realistic views under narrow baselines, they often overfit the training views when dealing with wide-baseline panoramic images due to the difficulty in learning precise geometry from sparse 360-degree views. This paper presents Splatter-360, a novel end-to-end generalizable 3DGS framework specifically designed to handle wide-baseline panoramic images. Unlike previous approaches, Splatter-360 performs multi-view matching directly in the spherical domain by constructing a spherical cost volume through a spherical sweep algorithm, enhancing the network's depth perception and geometry estimation. Additionally, we introduce a 3D-aware bi-projection encoder to mitigate the distortions inherent in panoramic images and integrate cross-view attention to improve feature interactions across multiple viewpoints. This enables robust 3D-aware feature representations and real-time rendering capabilities. Experimental results on the HM3D and Replica demonstrate that Splatter-360 significantly outperforms state-of-the-art NeRF and 3DGS methods (e.g., PanoGRF, MVSplat, DepthSplat, and HiSplat) in both synthesis quality and generalization performance …
Poster
Hyunwoo Park · Gun Ryu · Wonjun Kim

[ ExHall D ]

Abstract
Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity.
Poster
Dian Zheng · Cheng Zhang · Xiao-Ming Wu · Cao Li · Chengfei Lv · Jian-Fang Hu · Wei-Shi Zheng

[ ExHall D ]

Abstract
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is not suitable for evaluating the distortion. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the ``visual cheating'' phenomenon in previous works (i.e., tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose PanoDecouple, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that …
Poster
Yucheng Mao · Boyang Wang · Nilesh Kulkarni · Jeong Joon Park

[ ExHall D ]

Abstract
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
Poster
Hao Wen · Zehuan Huang · Yaohui Wang · Xinyuan Chen · Lu Sheng

[ ExHall D ]

Abstract
Existing image-to-3D creation methods typically split the task into two individual stage: multi-view image generation and 3D reconstruction, leading to two main limitations: (1) In multi-view generation stage, the multi-view generated images present a challenge to preserving 3D consistency;; (2) In 3D reconstruction stage, there is a domain gap between real training data and generated multi-view input during inference. To address these issues, we propose Ouroboros3D, end-to-end trainable framework that integrates multi-view generation and 3D reconstruction into a recursive diffusion process through feedback mechanism.Our framework operates through iterative cycles where each cycle consists of a denoising process and a reconstruction step.By incorporating a 3D-aware feedback mechanism, our multi-view generative model leverages the explicit 3D geometric information (e.g. texture, position) from the feedback of reconstruction results of the previous process as conditions, thus modeling consistency at the 3D geometric level. Furthermore, through joint training of both the multi-view generative and reconstruction models, we alleviate reconstruction stage domain gap and enable mutual enhancement within the recursive process. Experimental results demonstrate that Ouroboros3D outperforms methods that treat these stages separately and those that combine them only during inference, achieving superior multi-view consistency and producing 3D models with higher geometric realism.
Poster
Wenyuan Zhang · Yixiao Yang · Han Huang · Liang Han · Kanle Shi · Yu-Shen Liu · Zhizhong Han

[ ExHall D ]

Abstract
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose \textbf{MonoInstance}, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
Poster
Bo Ji · Angela Yao

[ ExHall D ]

Abstract
Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations -- each optimized for specific scene regions -- into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. Codes will be released upon acceptance.
Poster
Xiangyu Liu · Xiaomei Zhang · Zhiyuan Ma · Xiangyu Zhu · Zhen Lei

[ ExHall D ]

Abstract
Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user’s input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.
Poster
Khiem Vuong · Anurag Ghosh · Deva Ramanan · Srinivasa G. Narasimhan · Shubham Tulsiani

[ ExHall D ]

Abstract
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 3% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 50%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, …
Poster
Qihang Zhang · Shuangfei Zhai · Miguel Ángel Bautista · Kevin Miao · Alexander Toshev · Joshua Susskind · Jiatao Gu

[ ExHall D ]

Abstract
Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation.Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.
Poster
Haosen Yang · Chenhao Zhang · Wenqing Wang · Marco Volino · Adrian Hilton · Li Zhang · Xiatian Zhu

[ ExHall D ]

Abstract
Point management is critical for optimizing 3D Gaussian Splatting models, as point initiation (e.g., via structure from motion) is often distributionally inappropriate. Typically, Adaptive Density Control (ADC) algorithm is adopted, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. We reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) due to inability of identifying all 3D zones requiring point densification, and lacking an appropriate mechanism to handle ill-conditioned points with negative impacts (e.g., occlusion due to false high opacity).To address these limitations, we propose a Localized Point Management(LPM) strategy, capable of identifying those error-contributing zones in greatest need for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, subject to image rendering errors. We apply point densification in the identified zones and then reset the opacity of the points in front of these regions, creating a new opportunity to correct poorly conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing static 3D and dynamic 4D Gaussian Splatting models with minimal additional cost.Experimental evaluations validate the efficacy of our LPM in boosting a variety of existing …
Poster
Sebastian Koch · Johanna Wald · Mirco Colosi · Narunas Vaskevicius · Pedro Hermosilla · Federico Tombari · Timo Ropinski

[ ExHall D ]

Abstract
Neural radiance fields are an emerging 3D scene representation and recently even been extended to learn features for scene understanding by distilling open-vocabulary features from vision-language models. However, current method primarily focus on object-centric representations, supporting object segmentation or detection, while understanding semantic relationships between objects remains largely unexplored. To address this gap, we propose RelationField, the first method to extract inter-object relationships directly from neural radiance fields. RelationField represents relationships between objects as pairs of rays within a neural radiance field, effectively extending its formulation to include implicit relationship queries. To teach RelationField complex, open-vocabulary relationships, relationship knowledge is distilled from multi-modal LLMs. To evaluate RelationField, we solve open-vocabulary 3D scene graph generation tasks and relationship-guided instance segmentation, achieving state-of-the-art performance in both tasks.
Poster
Weikang Bian · Zhaoyang Huang · Xiaoyu Shi · Yijin Li · Fu-Yun Wang · Hongsheng Li

[ ExHall D ]

Abstract
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant …
Poster
Ashish Kumar · A. N. Rajagopalan

[ ExHall D ]

Abstract
Neural Radiance Fields (NeRFs) have made significant advances in rendering novel photorealistic views for both static and dynamic scenes. However, most prior works assume ideal conditions of artifact-free visual inputs i.e., images and videos. In real scenarios, artifacts such as object motion blur, camera motion blur, or lens defocus blur are ubiquitous. Some recent studies have explored novel view synthesis using blurred input frames by examining either camera motion blur, defocus blur, or both. However, these studies are limited to static scenes. In this work, we enable NeRFs to deal with object motion blur whose local nature stems from the interplay between object velocity and camera exposure time. Often, the object motion is unknown and time varying, and this adds to the complexity of scene reconstruction. Sports videos are a prime example of how rapid object motion can significantly degrade video quality for static cameras by introducing motion blur. We present an approach for realizing motion blur-free novel views of dynamic scenes from input videos with object motion blur captured from static cameras spanning multiple poses. We propose a NeRF-based analytical framework that elegantly correlates object three-dimensional (3D) motion across views as well as time to the observed blurry videos. …
Poster
Seungjun Lee · Gim Hee Lee

[ ExHall D ]

Abstract
Reconstructing sharp 3D representations from blurry multi-view images are long-standing problem in computer vision. Recent works attempt to enhance high-quality novel view synthesis from the motion blur by leveraging event-based cameras, benefiting from high dynamic range and microsecond temporal resolution. However, they often reach sub-optimal visual quality in either restoring inaccurate color or losing fine-grained details. In this paper, we present DiET-GS, a diffusion prior and event stream-assisted motion deblurring 3DGS. Our framework effectively leverages blur-free event streams and diffusion prior in a two-stage training strategy. Specifically, we introduce the novel framework to constraint 3DGS with event double integral, achieving both accurate color and well-defined details. Additionally, we propose a simple technique to leverage diffusion prior to further enhance the edge details. Qualitative and quantitative results on both synthetic and real-world data demonstrate that our DiET-GS is capable of producing better quality of novel views compared to the existing baselines. The code will be publicly available.
Poster
Yifan Wang · Peishan Yang · Zhen Xu · Jiaming Sun · Zhanhua Zhang · chen yong · Hujun Bao · Sida Peng · Xiaowei Zhou

[ ExHall D ]

Abstract
This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin. The code will be released for reproducibility.
Poster
Hao Li · Sicheng Li · Xiang Gao · AbudouaihatiBatuer · Lu Yu · Yiyi Liao

[ ExHall D ]

Abstract
Immersive video offers a 6-Dof-free viewing experience, potentially playing a key role in future video technology. Recently, 4D Gaussian Splatting has gained attention as an effective approach for immersive video due to its high rendering efficiency and quality, though maintaining quality with manageable storage remains challenging. To address this, we introduce GIFStream, a novel 4D Gaussian representation using a canonical space and a deformation field enhanced with time-dependent feature streams. These feature streams enable complex motion modeling and allow efficient compression by leveraging their motion-awareness and temporal correspondence. Additionally, we incorporate both temporal and spatial compression networks for end-to-end compression. Experimental results show that GIFStream delivers high-quality immersive video at 30 Mbps, with real-time rendering and fast decoding on an RTX 4090.
Poster
Hongchi Xia · Entong Su · Marius Memmel · Arhan Jain · Raymond Yu · Numfor Mbiziwo-Tiapo · Ali Farhadi · Abhishek Gupta · Shenlong Wang · Wei-Chiu Ma

[ ExHall D ]

Abstract
Creating virtual digital replicas from real-world data unlocks significant potential across domains like gaming and robotics. In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a *photorealistic* and *interactive* digital environment. Our approach centers on two main contributions: (i) a reconstruction module based on a *dual scene representation* that reconstructs the scene with *fine-grained geometric details*, and (ii) an *articulation* module that identifies articulation types and hinge positions, reconstructs simulatable shapes and appearances and integrates them into the scene. The resulting virtual environment is photorealistic, interactive, and runs in real time, with compatibility for game engines and robotic simulation platforms. We demonstrate the potential of DRAWER by using it to automatically create an interactive game in Unreal Engine and to enable real-to-sim-to-real transfer for robotics applications. Our paper consists of multiple videos. We recommend the readers to use Adobe Acrobat.
Poster
Shoichiro Takeda · Yasunori Akagi

[ ExHall D ]

Abstract
We propose novel fast algorithms for the Gromov–Wasserstein problem (GW) using cyclic symmetry of input data. Such GW with cyclic symmetry naturally appears as an object matching task underlying various real-world computer vision applications, e.g., image registration, point cloud registration, stereo matching, and 3D reconstruction. Gradient-based algorithms have been used to solve GW, and our main idea is to use the following remarkable and non-trivial property: By setting the initial solution to have cyclic symmetry, all intermediate solutions and matrices appearing in the gradient-based algorithms have the same cyclic symmetry until convergence. Based on this property, our gradient-based algorithms restrict the solution space to have cyclic symmetry and update only one of the symmetric parts of solutions and matrices at each iteration, which results in fast computation. Furthermore, the original gradient-based algorithms and ours must solve the Optimal Transport problem (OT) at each iteration, but only in ours does this problem exhibit cyclic symmetry. This cyclic OT can be solved efficiently, and as a result, the total computational time of our algorithms is dramatically faster than the original ones. Experiments showed the effectiveness of our algorithms in synthetic and real-world data with strict and approximate cyclic symmetry, respectively.
Poster
Awais Nizamani · Hamid Laga · Guanjin Wang · Farid Boussaid · Mohammed Bennamoun · Anuj Srivastava

[ ExHall D ]

Abstract
We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve overtime. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets.
Poster
Paul Roetzer · Viktoria Ehm · Daniel Cremers · Zorah Lähner · Florian Bernard

[ ExHall D ]

Abstract
In this work we address various shape matching problems that can be cast as finding cyclic paths in a product graph. This involves for example 2D-3D shape matching, 3D shape matching, or the matching of a contour to a graph. In this context, matchings are typically obtained as the minimum cost cycle in the product graph. Instead, inspired by related works on model-based image segmentation, we consider minimum ratio cycles, which we combine with the recently introduced conjugate product graph in order to allow for higher-order matching costs. With that, on the one hand we avoid the bias of obtaining matchings that involve fewer/shorter edges, while on the other hand being able to impose powerful geometric regularisation, e.g. to avoid zig-zagging. In our experiments we demonstrate that this not only leads to improved matching accuracy in most cases, but also to significantly reduced runtimes (up to two orders of magnitude, depending on the setting). Our GPU implementation will be made publicly available upon acceptance.
Poster
Ryota Maeda · Yunseong Moon · Seung-Hwan Baek

[ ExHall D ]

Abstract
Light-matter interactions modify both the intensity and polarization state of light. Changes in polarization, represented by a Mueller matrix, encode detailed scene information. Existing optical ellipsometers capture Mueller-matrix images; however, they are often limited to static scenes due to long acquisition times. Here, we introduce Event Ellipsometer, a method for acquiring Mueller-matrix images of dynamic scenes. Our imaging system employs fast-rotating quarter-wave plates (QWPs) in front of a light source and an event camera that asynchronously captures intensity changes induced by the rotating QWPs. We develop an ellipsometric-event image formation model, a calibration method, and an ellipsometric-event reconstruction method. We experimentally demonstrate that Event Ellipsometer enables Mueller-matrix imaging at 30fps, extending ellipsometry to dynamic scenes.
Poster
Noah Stier · Alex Rich · Pradeep Sen · Tobias Höllerer

[ ExHall D ]

Abstract
Recent image-based 3D reconstruction methods have achieved excellent quality for indoor scenes using 3D convolutional neural networks. However, they rely on a high-resolution grid in order to achieve detailed output surfaces, which is quite costly in terms of compute time, and it results in large mesh sizes that are more expensive to store, transmit, and render. In this paper we propose a new solution to this problem, using adaptive sampling. By re-formulating the final layers of the network, we are able to analytically bound the local surface complexity, and set the local sample rate accordingly. Our method, AniGrad, achieves an order of magnitude reduction in both surface extraction latency and mesh size, while preserving mesh accuracy and detail.
Poster
Zetong Zhang · Manuel Kaufmann · Lixin Xue · Jie Song · Martin R. Oswald

[ ExHall D ]

Abstract
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in human pose estimation, novel view synthesis and runtime.
Poster
Hongtao Yu · Shaohui Song · Lihu Sun · Wenkai Su · Xiaodong Yang · Chengming Liu

[ ExHall D ]

Abstract
Quad Photodiode (QPD) sensors represent an evolution by providing four sub-views, whereas dual-pixel (DP) sensors are limited to two sub-views. In addition to enhancing auto-focus performance, QPD sensors also enable disparity estimation in horizontal and vertical directions. However, the characteristics of QPD sensors, including uneven illumination across sub-views and the narrow baseline, render algorithm design difficult. Furthermore, effectively utilizing the two-directional disparity of QPD sensors remains a challenge. The scarcity of QPD disparity datasets also limits the development of learning-based methods. In this work, we address these challenges by first proposing a DPNet for DP disparity estimation. Specifically, we design an illumination-invariant module to reduce the impact of illumination, followed by a coarse-to-fine module to estimate sub-pixel disparity. Building upon the DPNet, we further propose a QuadNet, which integrates the two-directional disparity via an edge-aware fusion module. To facilitate the evaluation of our approaches, we propose the first QPD disparity dataset QPD2K, comprising 2,100 real-world QPD images and corresponding disparity maps. Experiments demonstrate that our approaches achieve state-of-the-art performance in DP and QPD disparity estimation.
Poster
Songsong Yu · Yuxin Chen · Zhongang Qi · Zeke Xie · Yifan Wang · Lijun Wang · Ying Shan · Huchuan Lu

[ ExHall D ]

Abstract
With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in …
Poster
Hualie Jiang · Zhiqiang Lou · Laiyan Ding · Rui Xu · Minglang Tan · jerett · Rui Huang

[ ExHall D ]

Abstract
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D leaderboards, ranking 1st on many metrics. The code and models will be made publicly available.
Poster
Marwane Hariat · Antoine Manzanera · David Filliat

[ ExHall D ]

Abstract
Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI and Cityscapes, our model demonstrates robust performance, surpassing conventional self-supervised methods in MDE.
Poster
Weilong Yan · Ming Li · Li Haipeng · Shuwei Shao · Robby T. Tan

[ ExHall D ]

Abstract
Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data.Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results.In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions.In the innovative real adaptation which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels.We further introduce a new regularization by gathering explicit depth distribution prior to constrain the model facing real-world data.Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame settings. We achieve improvements of 7.5\% in AbsRel and 4.3\% in RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation on DrivingStereo (rain, fog), our method generalizes better than previous ones. …
Poster
Zador Pataki · Paul-Edouard Sarlin · Johannes Schönberger · Marc Pollefeys

[ ExHall D ]

Abstract
While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap or low-parallax conditions.Because capturing images that avoid both pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users.In this paper, we overcome both limitations by augmenting the classical SfM paradigm with monocular depth and normal priors, which can be inferred by deep neural networks with increasing accuracy.Our approach is significantly more robust than existing ones in extreme low- or high-overlap scenarios but retains state-of-the-art performance in easier, nominal conditions thanks to a tight integration of monocular and multi-view constraints.We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM.Thanks to principled uncertainty propagation, our approach is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation.
Poster
Daniil Sinitsyn · Linus Härenstam-Nielsen · Daniel Cremers

[ ExHall D ]

Abstract
We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions.Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate.In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion.Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.
Poster
Charalambos Tzamos · Viktor Kocur · Yaqing Ding · Daniel Barath · Zuzana Berger Haladova · Torsten Sattler · Zuzana Kukelova

[ ExHall D ]

Abstract
We study the challenging problem of estimating the relative pose of three calibrated cameras from four point correspondences. We propose novel efficient solutions to this problem that are based on the simple idea of using four correspondences to estimate an approximate geometry of the first two views. We model this geometry either as an affine or a fully perspective geometry estimated using one additional approximate correspondence. We generate such an approximate correspondence using a very simple and efficient strategy, where the new point is the mean point of three corresponding input points. The new solvers are efficient and easy to implement, since they are based on existing efficient minimal solvers, i.e., the 4-point affine fundamental matrix, the well-known 5-point relative pose solver, and the \texttt{P3P} solver. Extensive experiments on real data show that the proposed solvers, when properly coupled with local optimization, achieve state-of-the-art results, with the novel solver based on approximate mean-point correspondences being more robust and precise than the affine-based solver.
Poster
Jianing Yang · Alexander Sax · Kevin Liang · Mikael Henaff · Hao Tang · Ang Cao · Joyce Chai · Franziska Meier · Matt Feiszli

[ ExHall D ]

Abstract
Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing multiple views in parallel. Fast3R's Transformer-based architecture forwards N images in a single pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.
Poster
Shangzhan Zhang · Jianyuan Wang · Yinghao Xu · Nan Xue · Christian Rupprecht · Xiaowei Zhou · Yujun Shen · Gordon Wetzstein

[ ExHall D ]

Abstract
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications.Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis.Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds).
Poster
Runfeng Li · Mikhail Okunev · Zixuan Guo · Anh H Duong · Christian Richardt · Matthew O’Toole · James Tompkin

[ ExHall D ]

Abstract
We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight cameras using raw sensor samples that is as accurate as past methods and is 100× faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. Recent 3D Gaussian splatting methods often depend on multi-view data to produce satisfactory results and are brittle in their optimizations otherwise.In time-of-flight radiance field reconstruction, the property of interest---depth---is not directly optimized, causing additional challenges.We describe how these problems have a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussians.Then, we incorporate two heuristics into our optimization to improve the accuracy of scene geometry for under-constrained time-of-flight Gaussians.Experimental results show that our approach produces accurate reconstructions under constrained sensing conditions, including for fast motions like swinging baseball bats.
Poster
Lea Müller · Hongsuk Choi · Anthony Zhang · Brent Yi · Jitendra Malik · Angjoo Kanazawa

[ ExHall D ]

Abstract
We introduce Humans and Structure from Motion'', a novel approach for reconstructing multiple people within a metric world coordinate system from a sparse set of images capturing a scene. Our method jointly estimates human body pose, shape, camera positions, and scene structure, capturing the spatial relationships among people and their location in the environment. Unlike existing methods that require calibrated setups, our approach operates with minimal constraints by leveraging the strength of both human body priors and data-driven SfM. By leveraging multi-view geometry, our method is the first work that effectively recovers humans and scene structure without assumptions about human-scene contact. We evaluate our approach on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human location estimation within the world coordinate frame (3.51m to 1.04m and 2.9m to 0.56m respectively). Notably, our results also reveal that incorporating human data in the classical SfM task improves camera pose estimation (RRA@15: 0.74 to 0.89 in EgoHumans), when multiple humans are used for correspondence. We will release our code and data.
Poster
Kai Luo · Hao Shi · Sheng Wu · Fei Teng · Mengfei Duan · Chang Huang · Yuhang Wang · Kaiwei Wang · Kailun Yang

[ ExHall D ]

Abstract
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in large field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset—a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as wide fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45\% on QuadTrack, surpassing the baseline by 6.81%. The dataset …
Poster
Jintao Zhang · Zimin Xia · Mingyue Dong · Shuhan Shen · Linwei Yue · Xianwei Zheng

[ ExHall D ]

Abstract
This paper proposes a multi-view collaborative matching strategy to address the issue of sparse and broken tracks in Structure-from-Motion. We observe that the two-view matching paradigms applied to image set matching often lead to unreliable correspondences when the selected independent image pairs exhibit low connection, high occlusions and drastic viewpoint changes. This is due to the significant loss of information during 3D-to-2D projection and two-view images can only provide a very limited perception of the holistic 3D scene. Accordingly, we propose a multi-view collaborative matching network (CoMatcher) that (i) leverages complementary context cues from different views to form a holistic understanding of the 3D scene and (ii) utilizes multi-view consistency constraints to infer a globally optimal solution. Extensive experiments on various complicated scenarios demonstrates the superiority of our multi-view collaborative matching strategy over the mainstream two-view matching paradigm.
Poster
WooJu Lee · Juhye Park · Dasol Hong · Changki Sung · Youngwoo Seo · DongWan Kang · Hyun Myung

[ ExHall D ]

Abstract
Accurate localization is essential for autonomous driving, but GNSS-based methods struggle in challenging environments such as urban canyons. Cross-view pose optimization offers an effective solution by directly estimating vehicle pose using satellite-view images. However, existing methods primarily rely on cross-view features at a given pose, neglecting fine-grained contexts for precision and global contexts for robustness against large initial pose errors. To overcome these limitations, we propose PIDLoc, a novel cross-view pose optimization approach inspired by the proportional-integral-derivative (PID) controller. The PIDLoc comprises the PID branches to model cross-view feature relationships and the spatially aware pose estimator (SPE) to estimate the pose from these relationships. The PID branches leverage feature differences for local context (P branch), aggregated feature differences for global context (I branch), and gradients of feature differences for precise pose adjustment (D branch) to enhance localization accuracy under large initial pose errors. Integrated with the PID branches, the \dnnpose{} captures spatial relationships within the PID-branch features for consistent localization. Experimental results demonstrate that the PIDLoc achieves state-of-the-art performance in cross-view pose estimation for the KITTI dataset, reducing position error by 37.8% compared with the previous state-of-the-art.
Poster
Shengze Wang · Jiefeng Li · Tianye Li · Ye Yuan · Henry Fuchs · Koki Nagano · Shalini De Mello · Michael Stengel

[ ExHall D ]

Abstract
Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera.Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide …
Poster
Wanhua Li · Renping Zhou · Jiawei Zhou · Yingwei Song · Johannes Herter · Minghan Qin · Gao Huang · Hanspeter Pfister

[ ExHall D ]

Abstract
Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing …
Poster
Gyeongjin Kang · Jisang Yoo · Jihyeon Park · Seungtae Nam · Hyeonsoo Im · Shin sangheon · Sangpil Kim · Eunbyung Park

[ ExHall D ]

Abstract
We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images. These settings are inherently ill-posed due to the lack of ground-truth data, learned geometric information, and the need to achieve accurate 3D reconstruction without fine-tuning, making it difficult for conventional methods to produce high-quality results. Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques, resulting in reciprocal improvements in both pose accuracy and 3D reconstruction quality. Furthermore, we incorporate a matching-aware pose estimation network and a depth refinement module to enhance geometry consistency across views, ensuring more accurate and stable 3D reconstructions. To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV. SelfSplat achieves superior results over previous state-of-the-art methods in both appearance and geometry quality, also demonstrates strong cross-dataset generalization capabilities. Extensive ablation studies and analysis also validate the effectiveness of our proposed methods.
Poster
Xingyu Liu · Gu Wang · Ruida Zhang · Chenyangguang Zhang · Federico Tombari · Xiangyang Ji

[ ExHall D ]

Abstract
Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object's pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for UNseen One-reference-based object Pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset will be available upon acceptance.
Poster
Junjie Chen · Weilong Chen · Yifan Zuo · Yuming Fang

[ ExHall D ]

Abstract
Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform …
Poster
Qingyuan Wang · Rui Song · Jiaojiao Li · Kerui Cheng · David Ferstl · Yinlin Hu

[ ExHall D ]

Abstract
We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinements either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for iterative RGB refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent match network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train the refinement network on a combination of dataset Objaverse, GSO and ShapeNet, and demonstrate on BOP datasets with novel objects that, after using our method, the result of most state-of-the-art methods improves significantly, without any retraining or fine-tuning.
Poster
Ziqin Huang · Gu Wang · Chenyangguang Zhang · Ruida Zhang · Xiu Li · Xiangyang Ji

[ ExHall D ]

Abstract
Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation.
Poster
Wen-Hsuan Chu · Lei Ke · Jianmeng Liu · Mingxiao Huo · Pavel Tokmakov · Katerina Fragkiadaki

[ ExHall D ]

Abstract
We address the challenging problem of generating a dynamic 4D scene across views and over time from monocular videos. We target in-the-wild multi-object videos with heavy occlusions and propose Robust4DGen, a model that decomposes the scene into object tracks and optimizes a differentiable and deformable set of 3D Gaussians for each. Robust4DGen captures 2D occlusions from a 3D perspective by jointly splatting Gaussians of all objects to compute rendering errors in observed frames. Rather than relying on scene-level view generation models, which struggle to generalize due to the combinatorial complexity of scene views, we keep the Gaussian grouping information and additionally utilize object-centric, view-conditioned generative models for each entity to optimize score distillation objectives from unobserved viewpoints. We achieve this by applying differentiable affine transformations to jointly optimize both global image re-projection and object-centric score distillation objectives within a unified framework. To enable a thorough evaluation of generation and motion accuracy under multi-object occlusions, we annotate MOSE-PTS with accurate 2D point tracks, which is a subset of the challenging MOSE video segmentation benchmark. Through quantitative analysis and human evaluation, we demonstrate that our method generates more realistic 4D multi-object scenes and produces more accurate point tracks across spatial and temporal …
Poster
Guangzhao He · Chen Geng · Shangzhe Wu · Jiajun Wu

[ ExHall D ]

Abstract
The motion of deformable 4D objects lies in a low-dimensional manifold. To better capture the low dimensionality and enable better controllability, traditional methods have devised several heuristic-based methods, i.e., rigging, to manipulate the dynamic objects intuitively. However, such representations are not scalable due to the need for expert knowledge of specific categories. Instead, we study the automatic exploration of such low-dimensional structures in a purely data-driven manner. Specifically, we design a novel representation that encodes deformable 4D objects into a sparse set of spatially grounded blobs and an instance-aware feature volume to disentangle the pose and instance information of the 3D shape. With such a representation, we can manipulate the pose of 3D objects intuitively by modifying the parameter of blobs, while preserving the rich instance-specific information. We evaluate the proposed method on a variety of object categories and demonstrate the effectiveness of the proposed framework.
Poster
Zekai Shao · Yufan Hu · Bin Fan · Hongmin Liu

[ ExHall D ]

Abstract
Maintaining stable tracking of objects in domain shift scenarios is crucial for RGB-T tracking, prompting us to explore the use of unlabeled test sample information for effective online model adaptation. However, current Test-Time Adaptation (TTA) methods in RGB-T tracking dramatically change the model's internal parameters during long-term adaptation. At the same time, the gradient computations involved in the optimization process impose a significant computational burden. To address these challenges, we propose a Parameter Update-Recovery Adaptation (PURA) framework based on parameter decomposition. Firstly, Our fast parameter update strategy adjusts model parameters using statistical information from test samples without requiring gradient calculations, ensuring consistency between the model and test data distribution. Secondly, our parameter decomposition recovery employs orthogonal decomposition to identify the principal update direction and recover parameters in this direction, aiding in the retention of critical knowledge. Finally, we leverage the information obtained from decomposition to provide feedback on the momentum during the update phase, ensuring a stable updating process. Experimental results demonstrate that PURA outperforms current state-of-the-art methods across multiple datasets, validating its effectiveness. The code is available in the Supplementary Materials.
Poster
Xinyu Xiang · Qinglong Yan · HAO ZHANG · Jiayi Ma

[ ExHall D ]

Abstract
The research on adversarial attacks against trackers primarily concentrates on the RGB modality, whereas the methodology for attacking RGB-T multi-modal trackers has not been explored so far. This work represents an innovative attempt to develop an adaptive cross attack framework via multi-modal response decoupling, generating multi-modal adversarial patches to evade RGB-T trackers. Specifically, a modal-aware adaptive attack strategy is introduced to weaken the modality with high common information contribution alternately and iteratively, achieving the modal decoupling attack. In order to perturb the judgment of the modal balance mechanism in the tracker, we design a modal disturbance loss to increase the distance of the response map of the single-modal adversarial samples in the tracker. Besides, we also propose a novel spatio-temporal joint attack loss to progressively deteriorate the tracker's perception of the target. Moreover, the design of the shared adversarial shape enables the generated multi-modal adversarial patches to be readily deployed in real-world scenarios, effectively reducing the interference of the patch posting process on the shape attack of the infrared adversarial layer. Extensive digital and physical domain experiments demonstrate the effectiveness of our multi-modal adversarial patch attack.
Poster
Ahyun Seo · Minsu Cho

[ ExHall D ]

Abstract
Symmetry is crucial for understanding structural patterns and supports tasks such as object recognition and scene understanding. This paper focuses on rotational symmetry, where objects remain unchanged when rotated around a central axis, requiring the detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching for identifying rotation centers and vertices, while recent approaches use convolutional neural networks (CNNs) as segmentation models for rotation center detection. However, 2D-based models struggle to preserve 3D geometric properties due to distortions caused by viewpoint variation. To address this, we propose a rotation symmetry detection model that directly predicts rotation centers and vertices in 3D space, projecting the results back to 2D while maintaining structural consistency. By incorporating a vertex reconstruction stage that enforces 3D geometric priors—such as equal side lengths and interior angles for regular polygons—our model achieves greater robustness and geometric accuracy. Experiments on DENDI dataset show that our approach outperforms previous state-of-the-art methods in rotation center detection and demonstrates the effectiveness of 3D geometric priors through ablation studies on vertex reconstruction.
Poster
Shining Wang · Yunlong Wang · Ruiqi Wu · Bingliang Jiao · Wenxuan Wang · Peng Wang

[ ExHall D ]

Abstract
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering 4,231 unique identities and containing 63,841 high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible …
Poster
Eric Hedlin · Munawar Hayat · Fatih Porikli · Kwang Moo Yi · Shweta Mahajan

[ ExHall D ]

Abstract
To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. While hypernetworks work well, training them is cumbersome, and often requires ground truth optimized weights for each sample. However, obtaining each of these weights is a training problem of its own---one needs to train, e.g., adaptation weights or even an entire neural field for hypernetworks to regress to. In this work, we propose a method to train hypernetworks, without the need for any per-sample ground truth. Our key idea is to learn a Hypernetwork `Field' and estimate the entire trajectory of network weight training instead of simply its converged state. In other words, we introduce an additional input to the Hypernetwork, the convergence state, which then makes it act as a neural field that models the entire convergence pathway of a task network. A critical benefit in doing so is that the gradient of the estimated weights at any convergence state must then match the gradients of the original task---this constraint alone is sufficient to train the Hypernetwork Field. We demonstrate the effectiveness of our method through the task of personalized image generation and 3D shape reconstruction from images and point clouds, demonstrating …
Poster
Takeshi Noda · Chao Chen · Junsheng Zhou · Weiqi Zhang · Yu-Shen Liu · Zhizhong Han

[ ExHall D ]

Abstract
Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods.
Poster
Xinran Yang · Donghao Ji · Yuanqi Li · Junyuan Xie · Jie Guo · Yanwen Guo

[ ExHall D ]

Abstract
Point cloud reconstruction is a critical process in 3D representation and reverse engineering. When it comes to CAD models, edges are significant features that play a crucial role in characterizing the geometry of 3D shapes. However, few points are exactly sampled on edges during acquisition, resulting in apparent artifacts for the reconstruction task. Upsampling point cloud is a direct technical route, but there is a main challenge that the upsampled points may not align with the model edge accurately. To overcome this, we develop an integrated framework to estimate edges by joint regression of three geometry features—point-to-edge direction, point-to-edge distance and point normal. Benefiting these features, we implement a novel refinement process to move and produce more points which lie accurately on edges of the model, allowing for high-quality edge-preserving reconstruction. Experiments and comparisons against previous methods demonstrate our method's effectiveness and superiority.
Poster
Lin Bie · Shouan Pan · Siqi Li · Yining Zhao · Yue Gao

[ ExHall D ]

Abstract
Although the fusion of images and LiDAR point clouds is crucial to many applications in computer vision, the relative poses of cameras and LiDAR scanners are often unknown. The general registration pipeline first establishes correspondences and then performs pose estimation based on the generated matches. However, 2D-3D correspondences are inherently challenging to establish due to the large gap between images and LiDAR point clouds. To this end, we build a bridge to alleviate the 2D-3D gap and propose a practical framework to align LiDAR point clouds to the virtual points generated by images. In this way, the modality gap is converted to the domain gap of point clouds. Moreover, we propose a virtual-spherical representation and adaptive distribution sample module to narrow the domain gap between virtual and LiDAR point clouds. Then, we explore the reliable correspondence pattern consistency through a graph-based selection process. We improve the correspondence representation through a graph neural network. Experimental results demonstrate that our method outperforms the state-of-the-art methods by more than 10.77% and 12.53% performance on the KITTI Odometry and nuScenes datasets, respectively. The results demonstrate that our method can effectively solve non-synchronized random frame registration.
Poster
Kang You · Tong Chen · Dandan Ding · M. Salman Asif · Zhan Ma

[ ExHall D ]

Abstract
Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression—an indispensable criterion for numerous industrial applications—remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels' occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25\% and 48.34\% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code will be made publicly available.
Poster
Changshuo Wang · Shuting He · Xiang Fang · Jiawei Han · Zhonghang Liu · Xin Ning · Weijun Li · Prayag Tiwari

[ ExHall D ]

Abstract
While existing pre-training-based methods have enhanced point cloud model performance, they have not fundamentally resolved the challenge of local structure representation in point clouds. The limited representational capacity of pure point cloud models continues to constrain the potential of cross-modal fusion methods and performance across various tasks. To address this challenge, we propose a Dynamic Acoustic Field Fitting Network (DAF-Net), inspired by physical acoustic principles. Specifically, we represent local point clouds as acoustic fields and introduce a novel Acoustic Field Convolution (AF-Conv), which treats local aggregation as an acoustic energy field modeling problem and captures fine-grained local shape awareness by dividing the local area into near field and far field. Furthermore, drawing inspiration from multi-frequency wave phenomena and dynamic convolution, we develop the Dynamic Acoustic Field Convolution (DAF-Conv) based on AF-Conv. DAF-Conv dynamically generates multiple weights based on local geometric priors, effectively enhancing adaptability to diverse geometric features. Additionally, we design a Global Shape-Aware (GSA) layer incorporating EdgeConv and multi-head attention mechanisms, which combines with DAF-Conv to form the DAF Block. These blocks are then stacked to create a hierarchical DAFNet architecture. Extensive experiments on point cloud classification, part segmentation, and few-shot semantic segmentation demonstrate that DAFNet significantly outperforms existing …
Poster
Xiaoyang Wu · Daniel DeTone · Duncan Frost · TIANWEI SHEN · Chris Xie · Nan Yang · Jakob Engel · Richard Newcombe · Hengshuang Zhao · Julian Straub

[ ExHall D ]

Abstract
In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the geometric shortcut, which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks. All code and weights will be made available.
Poster
Qi Zhang · Jibin Peng · Zhao Huang · Wei Feng · Di Lin

[ ExHall D ]

Abstract
The recent progress in semantic point cloud segmentation is attributed to deep networks, which require a large amount of point cloud data for training. However, how to collect substantial point-wise annotations of the point clouds at affordable cost for the end-to-end network training still needs to be solved. In this paper, we propose Generative Hard Example Augmentation (GHEA) to achieve novel examples of point clouds, which enrich the data for training the segmentation network. Firstly, GHEA employs the generative network to embed the discrepancy between the point clouds into the latent space. From the latent space, we sample multiple discrepancies for reshaping a point cloud to various examples, contributing to the richness of the training data. Secondly, GHEA mixes the reshaped point clouds by respecting their segmentation errors. This mixup allows the reshaped point clouds, which are difficult to segment, to join as the challenging example for network training. We evaluate the effectiveness of GHEA, which helps the popular segmentation networks to improve the performances.
Poster
Yuzhou Liu · Lingjie Zhu · Hanqiao Ye · Shangfeng Huang · Xiang Gao · Xianwei Zheng · Shuhan Shen

[ ExHall D ]

Abstract
In this paper, we present BWFormer, a novel Transformer-based model for building wireframe reconstruction from airborne LiDAR point cloud. The problem is solved in a ground-up manner by detecting the building corners in 2D, lifting and connecting them in 3D space afterwards with additional data augmentation.Due to the 2.5D characteristic of the airborne LiDAR point cloud, we simplify the problem by projecting the points on the ground plane to produce a 2D height map. With the height map, a heat map is first predicted with pixel-wise corner likelihood to predict the possible 2D corners.Then, 3D corners are predicted by a Transformer-based network with extra height embedding initialization.This 2D-to-3D corner detection strategy reduces the search space significantly.To recover the topological connections among the corners, edges are finally predicted from geometrical and visual cues in the height map with the proposed edge attention mechanism, which extracts holistic features and preserves local details simultaneously.In addition, due to the limited datasets in the field and the irregularity of the point clouds, a conditional latent diffusion model for LiDAR scanning simulation is utilized for data augmentation.BWFormer surpasses other state-of-the-art methods, especially in reconstruction completeness. We commit to release all our codes and pre-trained models.
Poster
Justin Lazarow · David Griffiths · Gefen Kohavi · Francisco Crespo · Afshin Dehghan

[ ExHall D ]

Abstract
We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the **Cubify-Anything 1M (CA-1M) dataset**, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish **Cubify Transformer (CuTR)**, a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods on CA-1M - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in …
Poster
Mohamed Abdelsamad · Michael Ulrich · Claudius Glaeser · Abhinav Valada

[ ExHall D ]

Abstract
Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.
Poster
Zhenxuan Zeng · Qiao Wu · Xiyu Zhang · Lin Yuanbo Wu · Pei An · Jiaqi Yang · Ji Wang · Peng Wang

[ ExHall D ]

Abstract
In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird’s Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network’s generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5\% and 91.4\%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9\%.
Poster
Yingping Liang · Yutao Hu · Wenqi Shao · Ying Fu

[ ExHall D ]

Abstract
Depth completion involves predicting dense depth maps from sparse LiDAR inputs, a critical task for applications such as autonomous driving and robotics. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. To overcome this limitation, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images to distill geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Nonetheless, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results show that models trained with our two-stage distillation framework achieve top-ranked performance on the KITTI benchmark, demonstrating improvements in both quantitative and qualitative metrics.
Poster
Hou-I Liu · Christine Wu · Jen-Hao Cheng · Wenhao Chai · Shian-yun Wang · Gaowen Liu · Hugo Latapie · Jhih-Ciang Wu · Jenq-Neng Hwang · Hong-Han Shuai · Wen-Huang Cheng

[ ExHall D ]

Abstract
Monocular 3D object detection (Mono3D) holds noteworthy promise for autonomous driving applications owing to the cost-effectiveness and rich visual context of monocular camera sensors. However, depth ambiguity poses a significant challenge, as it requires extracting precise 3D scene geometry from a single image, resulting in suboptimal performance when transferring knowledge from a LiDAR-based teacher model to a camera-based student model. To address this issue, we introduce Monocular Teaching Assistant Knowledge Distillation (MonoTAKD) to enhance 3D perception in Mono3D. Our approach presents a robust camera-based teaching assistant model that effectively bridges the representation gap between different modalities for teacher and student models, addressing the challenge of inaccurate depth estimation. By defining 3D spatial cues as residual features that capture the differences between the teacher and the teaching assistant models, we leverage these cues into the student model, improving its 3D perception capabilities. Experimental results show that our MonoTAKD achieves state-of-the-art performance on the KITTI3D dataset. Additionally, we evaluate the performance on nuScenes and KITTI raw datasets to demonstrate the generalization of our model to multi-view 3D and unsupervised data settings.
Poster
Yunfei Long · Abhinav Kumar · Xiaoming Liu · Daniel Morris

[ ExHall D ]

Abstract
Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. We will release the model and code upon publication.
Poster
Xingyue Liu · Jiahao Qi · Chen Chen · Kangcheng Bin · Ping Zhong

[ ExHall D ]

Abstract
Cross-Modality Re-Identification (VI-ReID) aims to achieve around-the-clock target matching, benefiting from the strengths of both RGB and infrared (IR) modalities. However, the field is hindered by limited datasets, particularly for vehicle VI-ReID, and by challenges such as modality bias training (MBT), stemming from biased pre-training on ImageNet. To tackle the above issues, this paper introduces an UCM-VeID V2 dataset benchmark for vehicle VI-ReID, and proposes a new self-supervised pre-training method, Cross-Modality Patch-Mixed Self-supervised Learning (PMSL). UCM-VeID V2 dataset features a significant increase in data volume, along with enhancements in multiple aspects. PMSL addresses MBT by learning modality-invariant features through Patch-Mixed Image Reconstruction (PMIR) and Modality Discrimination Adversarial Learning (MDAL), and enhances discriminability with Modality-Augmented Contrasting Cluster (MACC). Comprehensive experiments are carried out to validate the effectiveness of the proposed method.
Poster
Yunshuang Yuan · Yan Xia · Daniel Cremers · Monika Sester

[ ExHall D ]

Abstract
Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which is computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, \textit{SparseAlign}, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.
Poster
Luke Chen · Junyao Wang · Trier Mortlock · Pramod Khargonekar · Mohammad Al Faruque

[ ExHall D ]

Abstract
Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real-world autonomous systems. However, existing approaches typically quantify task-level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub-optimal outcomes.Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature-level epistemic uncertainty by leveraging hyperdimensional computing.Our method captures the channel and spatial uncertainties through channel and patch -wise projection and bundling techniques respectively.Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion.Our evaluations show that HyperDUM on average outperforms the state-of-the-art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties.Notably, HyperDUM requires 2.36× less Floating Point Operations and up to 38.30× less parameters than SOTA methods, providing an efficient solution for real-world autonomous systems.
Poster
Dongxu Wei · Zhiqi Li · Peidong Liu

[ ExHall D ]

Abstract
Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.
Poster
David T. Hoffmann · Syed Haseeb Raza · Hanqiu Jiang · Steffen Klingenhoefer · Denis Tananaev · Martin Meinke

[ ExHall D ]

Abstract
Scene flow estimation is a foundational task for many robotics applications, ranging from robust dynamic object detection to automatic labeling and sensor synchronization. Two distinct approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. While supervised methods are fast during inference and achieve high-quality results, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime or fail to converge to the right solution. Current optimization-based approaches often perform poorly on dynamic objects and mainly predict ego-motion. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that exhibits advantageous characteristics compared to the standard MLP-based formulation and 2) introduce a new multi-frame loss formulation. We combine both contributions in our new method, termed Floxels. On our ego-motion compensated benchmark, based on nuScenes and Argoverse, Floxels achieves state of the art (SOTA) results and performs on par with a recently proposed SOTA supervised method. At the same time compute costs scale significantly more gracefully with point cloud …
Poster
Jingyi Xu · Xieyuanli Chen · Junyi Ma · Jiawei Huang · Jintao Xu · Yue Wang · Ling Pei

[ ExHall D ]

Abstract
The task of occupancy forecasting (OCF) involves utilizing past and present perception data to predict future occupancy states of autonomous vehicle surrounding environments, which is critical for downstream tasks such as obstacle avoidance and path planning. Existing 3D OCF approaches struggle to predict plausible spatial details for movable objects and suffer from slow inference speeds due to neglecting the bias and uneven distribution of changing occupancy states in both space and time. In this paper, we propose a novel spatiotemporal decoupling vision-based paradigm to explicitly tackle the bias and achieve both effective and efficient 3D OCF. To tackle spatial bias in empty areas, we introduce a novel spatial representation that decouples the conventional dense 3D format into 2D bird’s-eye view (BEV) occupancy with corresponding height values, enabling 3D OCF derived only from 2D predictions thus enhancing efficiency. To reduce temporal bias on static voxels, we design temporal decoupling to improve end-to-end OCF by temporally associating instances via predicted flows. We develop an efficient multi-head network EfficientOCF to achieve 3D OCF with our devised spatiotemporally decoupled representation. A new metric, conditional IoU (C-IoU), is also introduced to provide a robust 3D OCF performance assessment, especially in datasets with missing or incomplete …
Poster
Rui Gong · Kim-Hui Yap · Weide Liu · Xulei Yang · Jun Cheng

[ ExHall D ]

Abstract
Online stereo rectification is critical for autonomous vehicles and robots in dynamic environments, where factors such as vibration, temperature fluctuations, and mechanical stress can affect rectification accuracy and severely degrade downstream stereo depth estimation. Current dominant approaches for online stereo rectification involve estimating relative camera poses in real time to derive rectification homographies. However, they do not directly optimize for rectification constraints, which leads to a gap. Additionally, the general-purpose correspondence matchers used in these methods are not trained for stereo rectification, while training of these matchers typically requires ground-truth correspondences which are not available in stereo rectification datasets. To address these limitations, we propose a matching-based stereo rectification framework that is directly optimized for rectification and does not require ground-truth correspondence annotations for training. Our framework incorporates a rectification-constrained estimator and applies multi-level, rectification-specific supervision that trains the matcher network for rectification without relying on ground-truth correspondences. Additionally, we create a new rectification dataset with ground-truth optical flow annotations, eliminating bias from evaluation metrics used in prior work that relied on pretrained keypoint matching or optical flow models. Extensive experiments show that our approach outperforms both state-of-the-art matching-based and matching-free methods in vertical flow metric by 10.7% on the …
Poster
Xiaolu Liu · Ruizi Yang · Song Wang · Wentong Li · Junbo Chen · Jianke Zhu

[ ExHall D ]

Abstract
Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. While recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose \textbf{\textit{UIGenMap}}, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce the excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on nuScenes dataset. Our code will be made publicly available.
Poster
yunlong lin · Zixu Lin · Haoyu Chen · Panwang Pan · Chenxin Li · Sixiang Chen · Kairun Wen · Yeying Jin · Wenbo Li · Xinghao Ding

[ ExHall D ]

Abstract
Vision-centric perception systems for autonomous driving often struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM (e.g., Llava-Llama3) as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50\% improvement in the average of all perception metrics on CleanBench-Real. Furthermore, it effectively supports high-level tasks, such as semantic segmentation and object detection.
Poster
Jingcheng Ni · Yuxin Guo · Yichen Liu · Rui Chen · Lewei Lu · Zehuan Wu

[ ExHall D ]

Abstract
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on pixel-level video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining pixel-level generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation.Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, …
Poster
Ze Yang · Jingkang Wang · Haowei Zhang · Sivabalan Manivasagam · Yun Chen · Raquel Urtasun

[ ExHall D ]

Abstract
High-quality 3D assets for traffic participants such as vehicles and motorcycles is critical for multi-sensor simulation, which is required for the safe end-to-end development of autonomy. Building assets from in-the-wild real-world data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that can only render close to the original viewpoints of observed actors, restricting usage in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high quality 3D assets with complete geometry and appearance. Key to our method is a reconstruct-then-generate'' approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a generative diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generative-based methods, unlocking diverse and scalable content creation for simulation.
Poster
Mariam Hassan · Sebastian Stapf · Ahmad Rahimi · Pedro M B Rezende · Yasaman Haghighi · David Brüggemann · Isinsu Katircioglu · Lin Zhang · Xiaoran Chen · Suman Saha · Marco Cannici · Elie Aljalbout · Botao Ye · Xi Wang · Aram Davtyan · Mathieu Salzmann · Davide Scaramuzza · Marc Pollefeys · Paolo Favaro · Alex Alahi

[ ExHall D ]

Abstract
World models predict future frames from past observations and actions, making them powerful simulators for ego-vision tasks with complex dynamics, such as autonomous driving. Nonetheless, existing world models for ego-vision mainly focus on the driving domain and the ego-vehicle's actions limiting the complexity and diversity of the generated scenes. In this work, we propose \textit{GEM}, a diffusion-based world model with generalized control strategy. By leveraging ego-trajectories and general image features, GEM not only allows for fine-grained control over the ego-motion, but also enables to control the motion of other objects in the scene and supports scene composition, by inserting new objects. GEM is multimodal, capable of generating both videos and future depth sequences, providing rich semantic and spatial output contexts. Although our primary focus remains on the domain of autonomous driving, we explore the adaptability of GEM to other ego-vision domain such as human activity and drone navigation. To evaluate GEM’s controllability, we propose a comprehensive evaluation framework. The results show the effectiveness of GEM in controlling the motion of objects within the scene, with conditional generation outperforming unconditional generation by 68% and 79% on Nuscenes and OpenDV respectively.
Poster
Inhwan Bae · Junoh Lee · Hae-Gon Jeon

[ ExHall D ]

Abstract
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. …
Poster
Ziying Song · Caiyan Jia · Lin Liu · Hongyu Pan · Yongchang Zhang · Junming Wang · Xingyu Zhang · Shaoqing Xu · Lei Yang · Yadan Luo

[ ExHall D ]

Abstract
End-to-end autonomous driving frameworks enable seamless integration of perception and planning but often rely on one-shot trajectory prediction, which may lead to unstable control and vulnerability to occlusions in single-frame perception. To address this, we propose the Momentum-Aware Driving (MomAD) framework, which introduces trajectory momentum and perception momentum to stabilize and refine trajectory predictions. MomAD comprises two core components: (1) Topological Trajectory Matching (TTM) employs Hausdorff Distance to select the optimal planning query that aligns with prior paths to ensure coherence; (2) Momentum Planning Interactor (MPI) cross-attends the selected planning query with historical queries to expand static and dynamic perception files. This enriched query, in turn, helps regenerate long-horizon trajectory and reduce collision risks. To mitigate noise arising from dynamic environments and detection errors, we introduce robust instance denosing during training, enabling the planning model to focus on critical signals and improve its robustness. To quantify planning stability, we introduce a novel Trajectory Prediction Consistency (TPC) metric. Experiments on the nuScenes dataset demonstrate that MomAD achieves superior long-term consistency (3s) compared to SOTA methods. Furthermore, we curate a Turning-nuScenes validation set to evaluate model performance in challenging turning scenarios, where MomAD reduces the collision rate by 26\% and TPC …
Poster
Shihao Wang · Zhiding Yu · Xiaohui Jiang · Shiyi Lan · Min Shi · Nadine Chang · Jan Kautz · Ying Li · Jose M. Alvarez

[ ExHall D ]

Abstract
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
Poster
Weizhen Wang · Chenda Duan · Zhenghao Peng · Yuxin Liu · Bolei Zhou

[ ExHall D ]

Abstract
Vision Language Models (VLMs) show promise as embodied agents in many mobility applications, yet there is a lack of a generalizable platform for evaluating their spatial reasoning and embodied scene understanding. We introduce MetaVQA, a comprehensive benchmark that assesses and enhances VLMs’ understanding of spatial relationships and embodied dynamics in driving scenes through Visual-Question-Answering (VQA) and closed-loop simulation. MetaVQA collects various question-answer pairs from diverse real-world traffic scenarios through Set-of-Mark prompting and top-down view ground-truth annotations of nuScenes and Waymo datasets to ensure real-world and object-centric instructions. We demonstrate that fine-tuning VLMs on the MetaVQA dataset improves their spatial reasoning and embodied scene understanding in safety-critical simulations. Code and data will be made available.
Poster
Kai Chen · Xiaodong Zhao · Yujie Huang · GuoyuFang · Xiao Song · Ruiping Wang · Ziyuan Wang

[ ExHall D ]

Abstract
The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to the inherent high uncertainty of agent intentions and the complex higher-order influences among neighboring groups. SocialMOIF is proposed to tackle these challenges, concentrating on the higher-order intention interactions among neighboring groups while reinforcing the primary role of first-order intention interactions between neighbors and the target agent. This method develops a multi-order intention fusion model to achieve a more comprehensive understanding of both direct and indirect intention information. Within SocialMOIF, a trajectory distribution approximator is designed to guide the trajectories toward values that align more closely with the actual data, thereby enhancing model interpretability. Furthermore, a global trajectory optimizer is introduced to enable more accurate and efficient parallel predictions. By incorporating a novel loss function that accounts for distance and direction during training, experimental results demonstrate that the model outperforms previous state-of-the-art baselines across multiple metrics in both dynamic and static datasets.
Poster
Guillem Font Font · Antonio Rubio · Luis Ferraz · Antonio Agudo

[ ExHall D ]

Abstract
Multi-agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real-world applications such as correct tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of uncertainty. Moreover, popular multi-modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a unified diffusion model designed to handle trajectory completion while providing state-wise uncertainty estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log-likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post-processing to enable error probability estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state-of-the-art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), highlighting the effectiveness of uncertainty and error probability estimation.
Poster
Greg Heinrich · Mike Ranzinger · Danny Yin · Yao Lu · Jan Kautz · Bryan Catanzaro · Andrew Tao · Pavlo Molchanov

[ ExHall D ]

Abstract
Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM. This strategy enables the creation of robust models more efficiently, combining the strengths of individual teachers while significantly reducing computational and resource demands. In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, weak initializations, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. Specifically, in the context of Vision Language Models, we introduce a token compression technique to maintain high-resolution information within a fixed token count. We release our top-performing models, available in multiple scales (-B, -L, and -H), alongside code and pretrained weights, to support further research and development in the community.
Poster
Kwan-Yee Lin · Stella X. Yu

[ ExHall D ]

Abstract
Despite significant progress in humanoid robotics, research remains fragmented: low-level motor skill learning often disregards the influence of long-horizontal goals on current movement and lacks situational awareness. While, high-level navigation struggles to accommodate real-world constraints and adapt to the irregularity of local terrains, falling short in last-step feasibility. To bridge these gaps, we present LEGO-H, a universal learning framework that trains humanoid robots to become expert hikers on complex trails by developing and integrating skills across all levels, embracing physical embodiment through both visual perceptual awareness and body dynamics. At the heart of LEGO-H's designs is the harmonization of robots' visual perception, decision-making, and motor skill execution -- grounded in the new perspectives on the Hierarchical Reinforcement Learning (HRL) framework and the knowledge transfer process of privileged learning. Our key innovations include: (1) TC-ViTs, a Temporal Vision Transformer variant tailored into HRL, framing local navigation as a sequential hallucination task, softly guiding locomotion policy learning. This design seamlessly grafts locomotion and goal navigation into a unified, end-to-end policy learning framework. (2) Hierarchical Loss Metric for Policy Distillation. To ensure the varsities of motor skills, LEGO-H harnesses the power of privileged learning. However, humanoid robots are highly articulated, where rationality of …
Poster
Jinliang Zheng · Jianxiong Li · Dongxiu Liu · Yinan Zheng · Zhihao Wang · Zhonghong Ou · Yu Liu · Jingjing Liu · Ya-Qin Zhang · Xianyuan Zhan

[ ExHall D ]

Abstract
Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-embodiment data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions.
Poster
Shibo Zhao · Sifan Zhou · Raphael Blanchard · Yuheng Qiu · Wenshan Wang · Sebastian Scherer

[ ExHall D ]

Abstract
Despite recent advances in deep learning, most existing learning IMU odometry methods are trained on specific datasets, lack generalization, and are prone to overfitting, which limits their real-world application. To address these challenges, we present Tartan IMU, a foundation model designed for generalizable, IMU-based state estimation across diverse robotic platforms.Our approach consists of three-stage: First, a pre-trained foundation model leverages over 100 hours of multi-platform data to establish general motion knowledge, achieving 36\% improvement in ATE over specialized models. Second, to adapt to previously unseen tasks, we employ the Low-Rank Adaptation (LoRA), allowing positive transfer with only 1.1 M trainable parameters. Finally, to support robotics deployment, we introduce online test-time adaptation, which eliminates the boundary between training and testing, allowing the model to continuously "learn as it operates" at 200 FPS in real-time.
Poster
Shengyi Qian · Kaichun Mo · Valts Blukis · David Fouhey · Dieter Fox · Ankit Goyal

[ ExHall D ]

Abstract
Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research.
Poster
Haifeng Huang · Xinyi Chen · Yilun Chen · Hao Li · Xiaoshen Han · zehan wang · Tai Wang · Jiangmiao Pang · Zhou Zhao

[ ExHall D ]

Abstract
Recent advancements in robot manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, enabling low-level policies to accurately interpret spatial information, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic policy that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with featuring a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
Poster
Jiaming Zhou · Teli Ma · Kun-Yu Lin · Zifan Wang · Ronghe Qiu · Junwei Liang

[ ExHall D ]

Abstract
Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks.To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner.Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements.These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models.Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.We will release the code and models.
Poster
Quanyuan Ruan · Jiabao Lei · Wenhao Yuan · Yanglin Zhang · Dekun Lu · Guiliang Liu · Kui Jia

[ ExHall D ]

Abstract
Differentiable rendering has gained significant attention in the field of robotics, with differentiable robot rendering emerging as an effective paradigm for learning robotic actions from image-space supervision. However, the lack of physical world perception in this approach may lead to potential collisions during action optimization. In this work, we introduce a novel improvement on previous efforts by incorporating physical awareness of collisions through the learning of a neural robotic collision classifier. This enables the optimization of actions that avoid collisions with static, non-interactable environments as well as the robot itself. To facilitate effective gradient optimization with the classifier, we identify the underlying issue and propose leveraging Eikonal regularization to ensure consistent gradients for optimization. Our solution can be seamlessly integrated into existing differentiable robot rendering frameworks, utilizing gradients for optimization and providing a foundation for future applications of differentiable rendering in robotics with improved reliability of interactions with the physical world. Both qualitative and quantitative experiments demonstrate the necessity and effectiveness of our method compared to previous solutions.
Poster
Yuanqi Yao · Siao Liu · Haoming Song · Delin Qu · Qizhi Chen · Yan Ding · Bin Zhao · Zhigang Wang · Dong Wang · Xuelong Li

[ ExHall D ]

Abstract
Learning a generalist robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in maintaining knowledge across skills, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to model primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are concatenated and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL's superior performance over state-of-the-art methods. Code and dataset will be released upon acceptance.
Poster
Yiming Zhong · Qi Jiang · Jingyi Yu · Yuexin Ma

[ ExHall D ]

Abstract
A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.
Poster
Yuxing Long · Jiyao Zhang · Mingjie Pan · Tianshu Wu · Taewhan Kim · Hao Dong

[ ExHall D ]

Abstract
Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave’s manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.
Poster
Sai Kumar Dwivedi · Dimitrije Antić · Shashank Tripathi · Omid Taheri · Cordelia Schmid · Michael J. Black · Dimitrios Tzionas

[ ExHall D ]

Abstract
Estimating the 3D pose and shape of interacting humans and objects from single in-the-wild images is important for mixed reality and robotics. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing work tackles these challenges by exploiting surface contact points on the body and object and using these to guide 3D reconstruction. Unfortunately, obtaining 3D contact annotations requires either expensive 3D ground truth or time-consuming manual labeling. Consequently, obtaining training data at scale is a challenge. We tackle this by developing a novel model called InteractVLM that harnesses the broad visual knowledge of large Visual-Language Models (VLMs). The problem is, however, that these large models do not directly “understand” 3D human-object contact. To address this, we exploit existing small datasets of 3D human-object interaction to fine-tune large models to understand contact. However, this is non-trivial, as such models reason “only” in 2D, while contact is inherently 3D. Thus, we introduce a novel “Render-Localize-Lift” module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. This lets InteractVLM infer 3D contacts for both …
Poster
Yujie Liang · Xiaobin Hu · Boyuan Jiang · Donghao Luo · Xu Peng · Kai WU · Chengming Xu · Wenhui Han · Taisong Jin · Chengjie Wang · Rongrong Ji

[ ExHall D ]

Abstract
Although diffusion-based image virtual try-on has made considerable progress, emerging approaches still struggle to effectively address the issue of hand occlusion (i.e., clothing regions occluded by the hand part), leading to a notable degradation of the try-on performance. To tackle this issue widely existing in real-world scenarios, we propose VTON-HandFit, leveraging the power of hand priors to reconstruct the appearance and structure for hand occlusion cases. Firstly, we tailor a Handpose Aggregation Net using the ControlNet-based structure explicitly and adaptively encoding the global hand and pose priors. Besides, to fully exploit the hand-related structure and appearance information, we propose Hand-feature Disentanglement Embedding module to disentangle the hand priors into the hand structure-parametric and visual-appearance features, and customize a masked cross attention for further decoupled feature embedding. Lastly, we customize a hand-canny constraint loss to better learn the structure edge knowledge from the hand template of model image. VTON-HandFit outperforms the baselines in qualitative and quantitative evaluations on the public dataset and our self-collected hand-occlusion Handfit-3K dataset particularly for the arbitrary hand pose occlusion cases in real-world scenarios. The code and dataset will be available.
Poster
Kaixin Fan · Pengfei Ren · Jingyu Wang · Haifeng Sun · Qi Qi · Zirui Zhuang · Jianxin Liao

[ ExHall D ]

Abstract
3D hand reconstruction is essential in non-contact human-computer interaction applications, but existing methods struggle with low-resolution images, which occur in slightly distant interactive scenes. Leveraging temporal information can mitigate the limitations of individual low-resolution images that lack detailed appearance information, thereby enhancing the robustness and accuracy of hand reconstruction. Existing temporal methods typically use joint features to represent temporal information, avoiding interference from redundant background information. However, joint features excessively disregard the spatial context of visual features, limiting hand reconstruction accuracy. We propose to integrate temporal joint features with visual features to construct a robust low-resolution visual representation. We introduce Triplane Features, a dense representation with 3D spatial awareness, to bridge the gap between the joint features and visual features that are misaligned in terms of representation form and semantics. Triplane Features are obtained by orthogonally projecting the joint features, embedding hand structure information into the 3D spatial context. Furthermore, we compress the spatial information of the three planes into a 2D dense feature thourgh Spatial-Aware Fusion to enhance the visual features. By using enhanced visual features enriched with temporal information for hand reconstruction, our method achieves competitive performance at much lower resolutions compared to state-of-the-art methods operating at high …
Poster
Li Zhang · mingliang xu · Jianan Wang · Qiaojun Yu · Lixin Yang · Yonglu Li · Cewu Lu · RujingWang · Liu Liu

[ ExHall D ]

Abstract
Garments are common in daily life and are important for embodied intelligence community. Current category-level garments pose tracking works focus on predicting point-wise canonical correspondence and learning a shape deformation in point cloud sequences. In this paper, motivated by the 2D warping space and shape prior, we propose GaPT-DAR, a novel category-level Garments Pose Tracking framework with integrated 2D Deformation And 3D Reconstruction function, which fully utilize 3D-2D projection and 2D-3D reconstruction to transform the 3D point-wise learning into 2D warping deformation learning. Specifically, GaPT-DAR firstly builds a Voting-based Project module that learns the optimal 3D-2D projection plane for maintaining the maximum orthogonal entropy during point projection. Next, a Garments Deformation module is designed in 2D space to explicitly model the garments warping procedure with deformation parameters. Finally, we build a Depth Reconstruction module to recover the 2D images into 3D warp field. We provide extensive experiments on VR-Folding dataset to evaluate our GaPT-DAR and the results show obvious improvements on most of the metrics compared to state-of-the-arts (i.e., GarmentNets and GarmentTracking). Codes will be made publicly available.
Poster
Dong Li · Wenqi Zhong · Wei Yu · Yingwei Pan · Dingwen Zhang · Ting Yao · Junwei Han · Tao Mei

[ ExHall D ]

Abstract
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention …
Poster
Shuhang Chen · Xianliang Huang · Zhizhou Zhong · Jihong Guan · Shuigeng Zhou

[ ExHall D ]

Abstract
3D anthropometric measurements have a variety of applications in industrial design and architecture (e.g. vehicle seating and cockpits), Clothing (e.g. military uniforms), Ergonomics (e.g. seating) and Medicine (e.g. nutrition and diabetes) etc. Therefore, there is a need for systems that can accurately extract human body measurements. Current methods estimate human body measurements from 3D scans, resulting in a heavy data collection burden. Moreover, minor variations in camera angle, distance, and body postures may significantly affect the measurement accuracy. In response to these challenges, this paper introduces a focused human body model for accurately extracting anthropometric measurements. Concretely, we design a Bypass Network based on CNN and ResNet architectures, which augments the frozen backbone SMPLer-X with additional feature extraction capabilities. On the other hand, to boost the efficiency of training a large-scale model, we integrate a dynamical loss function that automatically recalibrates the weights to make the network focus on targeted anthropometric parts. In addition, we construct a multimodal body measurement benchmark dataset consisting of depth, point clouds, mesh and corresponding body measurements to support model evaluation and future anthropometric measurement research. Extensive experiments on both open-source and the proposed human body datasets demonstrate the superiority of our approach over existing …
Poster
Jian Wang · Rishabh Dabral · Diogo Luvizon · Zhe Cao · Lingjie Liu · Thabo Beeler · Christian Theobalt

[ ExHall D ]

Abstract
This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.
Poster
Reyhaneh Hosseininejad · Megh Shukla · Saeed Saadatnejad · Mathieu Salzmann · Alex Alahi

[ ExHall D ]

Abstract
Human pose forecasting is inherently multimodal since multiple future motions exist for an observed pose sequence. However, learning this multimodality is challenging since the task is ill-posed. To address this issue, we propose an alternative paradigm to make the task well-posed. Additionally, while state-of-the-art methods predict multimodality, this is attained through a large volume of predictions obtained by oversampling. However, such an approach glosses over key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. Not only can MotionMap capture a variable number of modes per observation, but it also provides confidence measures for different modes. Further, MotionMap captures rare modes that are non-trivial to evaluate yet critical for robustness. Finally, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. We support our claims through multiple qualitative and quantitative experiments using …
Poster
Bin Ji · Ye Pan · zhimeng Liu · Shuai Tan · Xiaogang Jin · Xiaokang Yang

[ ExHall D ]

Abstract
Numerous researches on real-time motion generation primarily focus on kinematic aspects, often resulting in physically implausible outcomes. In this paper, we present POMP (P_hysics-cO_nsistent Human M_otion P_rior through Phase Manifolds"), a novel kinematics-based framework that synthesizes physically consistent motions by leveraging phase manifolds to align motion priors with physics constraints. POMP operates as a frame-by-frame autoregressive model with three core components: a diffusion-based kinematic module, a simulation-based dynamic module, and a phase encoding module. At each timestep, the kinematic module generates an initial target pose, which is subsequently refined by the dynamic module to simulate human-environment interactions. Although the physical simulation ensures adherence to physical laws, it may compromise the kinematic rationality of the posture. Consequently, directly using the simulated result for subsequent frame prediction may lead to cumulative errors. To address this, the phase encoding module performs semantic alignment in the phase manifold. Moreover, we present a pipeline in Unity for generating terrain maps and capturing full-body motion impulses from existing motion capture (MoCap) data. The collected terrain topology and motion impulse data facilitate the training of POMP, enabling it to robustly respond to underlying contactforces and applied dynamics. Extensive evaluations demonstrate the efficacy of POMP across various contexts, …
Poster
Zhanbo Huang · Xiaoming Liu · Yu Kong

[ ExHall D ]

Abstract
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
Poster
Mengqing Xue · Yifei Liu · Ling Guo · Shaoli Huang · Changxing Ding

[ ExHall D ]

Abstract
Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs. This paper’s code will be released.
Poster
Hua Yu · Weiming Liu · Gui Xu · Yaqing Hou · Yew-Soon Ong · Qiang Zhang

[ ExHall D ]

Abstract
Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process.In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis.DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with DivSDE. DSDFM is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training parameters.Through qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.
Poster
Nan Jiang · Hongjie Li · Ziye Yuan · Zimo He · Yixin Chen · Tengyu Liu · Yixin Zhu · Siyuan Huang

[ ExHall D ]

Abstract
Most text-guided motion editing methods cannot generate versatile motions as they rely on limited training triplets of original motion, edited motion, and editing instruction, which fail to cover the vast combinations of possible edits. To address this challenge, we introduce MotionCutMix, a training technique that dynamically composes a huge amount of training triplets by blending body part motions based on editing instructions. However, this technique introduces increased randomness and potential body part incoordination in the generated motions. To model such rich distribution, we propose MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive strategy reduces the window size to facilitate convergence, while the motion coordinator mitigates the artifacts of motion composition. Our model handles both spatial and temporal edits without leveraging extra motion information or LLMs. We further contribute newly captured and re-annotated datasets for multiple motion editing tasks. Experimental results demonstrate that MotionReFit excels in text-guided motion edits, closely adhering to textual directives. Furthermore, ablation studies reveal that the incorporation of MotionCutMix during training enhances the generalizability of the trained model, and does not significantly hinder training convergence.
Poster
Haonan Han · Xiangzuo Wu · Huan Liao · Zunnan Xu · Zhongyuan Hu · Ronghui Li · Yachao Zhang · Xiu Li

[ ExHall D ]

Abstract
Recently, text-to-motion models open new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex, nuanced relationship between textual prompts and desired motion outcomes. To address this issue, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and the frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.
Poster
Boeun Kim · Hea In Jeong · JungHoon Sung · Yihua Cheng · Jeongmin Lee · Ju Yong Chang · Sang-Il Choi · YOUNGGEUN CHOI · Saim Shin · Jungho Kim · Hyung Jin Chang

[ ExHall D ]

Abstract
This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.
Poster
Hsin-Ping Huang · Yang Zhou · Jui-Hsien Wang · Difan Liu · Feng Liu · Ming-Hsuan Yang · Zhan Xu

[ ExHall D ]

Abstract
Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.
Poster
longbin ji · Lei Zhong · Pengfei Wei · Changjian Li

[ ExHall D ]

Abstract
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under large-angle rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, an open-domain, Pose-Aware video dragging model for reliable 3D-aligned animations from 2D trajectories. Our method incorporates a novel Two-Stage Pose-Aware Pretraining framework, improving 3D comprehension across diverse trajectories. Specifically, we 1) construct a large-scale synthetic dataset containing 10k videos of objects following rotational trajectories and 2) enhance the model perception of object pose changes by generating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on open-domain videos, applying additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark scenarios demonstrate that PoseTraj not only excels in 3D Pose-Aligned dragging for rotational scenarios but also outperforms existing baselines in trajectory accuracy and video quality.
Poster
Junhyeong Cho · Kim Youwang · Hunmin Yang · Tae-Hyun Oh

[ ExHall D ]

Abstract
Recent monocular 3D shape reconstruction methods have shown promising zero-shot results on object-segmented images without any occlusions. However, their effectiveness is significantly compromised in real-world settings, due to imperfect object segmentation by off-the-shelf models and the prevalence of occlusions. To address these issues, we propose a unified regression model that integrates segmentation and reconstruction, specifically designed for occlusion-aware 3D shape reconstruction. To facilitate its reconstruction in the wild, we also introduce a scalable data synthesis pipeline that simulates a wide range of variations in objects, occluders, and backgrounds. Training on our synthesized data enables the proposed model to achieve state-of-the-art zero-shot results on real-world images, using significantly fewer model parameters than competing approaches. Our code and data would be publicly available.
Poster
Yiqing Liang · Abhishek Badki · Hang Su · James Tompkin · Orazio Gallo

[ ExHall D ]

Abstract
Foundation models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such model exists for scene flow.Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well.We solve three challenges to fix this problem.First, we create a method that jointly estimates geometry and motion for accurate prediction.Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes.Third, we evaluate different parameterizations for scene flow prediction and identify a natural and effective parameterization.Our resulting model outperforms existing methods as well baselines built on foundation models in term of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP.Overall, this makes scene flow prediction significantly more practical for in-the-wild use.
Poster
Yung-Hao Yang · Zitang Sun · Taiki Fukiage · Shin'ya Nishida

[ ExHall D ]

Abstract
As AI models are increasingly integrated into applications involving human interaction, understanding the alignment between human perception and machine vision has become essential. One example is the estimation of visual motion (optical flow) in dynamic applications such as driving assistance. While there are numerous optical flow datasets and benchmarks with ground truth information, human-perceived flow in natural scenes remains underexplored. We introduce HuPerFlow—a benchmark for human-perceived flow, measured at 2,400 locations across ten optical flow datasets, with \~38,400 response vectors collected through online psychophysical experiments. Our data demonstrate that human-perceived flow aligns with ground truth in spatiotemporally smooth locations while also showing systematic errors influenced by various environmental properties. Additionally, we evaluated several optical flow algorithms against human-perceived flow, uncovering both similarities and unique aspects of human perception in complex natural scenes. HuPerFlow is the first large-scale human-perceived flow benchmark for alignment between computer vision models and human perception, as well as for scientific exploration of human motion perception in natural scenes. The HuPerFlow benchmark will be available online upon acceptance.
Poster
Zihang Lai · Andrea Vedaldi

[ ExHall D ]

Abstract
Temporal consistency is critical in video prediction. Traditional methods, such as temporal attention mechanisms and 3D convolutions, often struggle with significant object movements and fail to capture long-range temporal dependencies in dynamic scenes. To address these limitations, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks — sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. Empirical evaluations on standard video estimation benchmarks demonstrate that models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baseline models.
Poster
Edward LOO · Tianyu HUANG · Peng Li · Zhiyang Dou · Cheng Lin · Zhiming Cui · Zhen Dong · Sai-Kit Yeung · Wenping Wang · Yuan Liu

[ ExHall D ]

Abstract
Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
Poster
Sili Chen · Hengkai Guo · Shengnan Zhu · Feihu Zhang · Zilong Huang · Jiashi Feng · Bingyi Kang

[ ExHall D ]

Abstract
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different …
Poster
Jiahao Shao · Yuanbo Yang · Hongyu Zhou · Youmin Zhang · Yujun Shen · Vitor Guizilini · Yue Wang · Matteo Poggi · Yiyi Liao

[ ExHall D ]

Abstract
This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that no contextual information shared between frames or clips is pivotal in fostering inconsistency. Instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, We design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth.
Poster
Huiwon Jang · Sihyun Yu · Jinwoo Shin · Pieter Abbeel · Younggyo Seo

[ ExHall D ]

Abstract
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128×128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames …
Poster
Shuwei Shi · Biao Gong · Xi Chen · DanDan Zheng · Shuai Tan · Zizheng Yang · Yuyuan Li · Jingwen He · Kecheng Zheng · Jingdong Chen · Ming Yang · Yinqiang Zheng

[ ExHall D ]

Abstract
The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion …
Poster
Sherwin Bahmani · Ivan Skorokhodov · Guocheng Qian · Aliaksandr Siarohin · Willi Menapace · Andrea Tagliasacchi · David B. Lindell · Sergey Tulyakov

[ ExHall D ]

Abstract
Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4× reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20k in-the-wild dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound …
Poster
Kaihua Chen · Deva Ramanan · Tarasha Khurana

[ ExHall D ]

Abstract
Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, thereby capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.
Poster
Juan Luis Gonzalez Bello · Xu Yao · Alex Whelan · Kyle Olszewski · Hyeongwoo Kim · Pablo Garrido

[ ExHall D ]

Abstract
We present an implicit video representation for occlusions, appearance, and motion disentanglement from monocular videos, which we refer to as Video Spatiotemporal Splines (VideoSPatS).Unlike previous methods that map time and coordinates to deformation and canonical colors, our VideoSPatS maps input coordinates into Spatial and Color Spline deformation fields Ds and Dc, which disentangle motion and appearance in videos. With spline-based parametrization, our method naturally generates temporally consistent flow and guarantees long-term temporal consistency, which is crucial for convincing video editing.Aided by additional prediction blocks, our VideoSPatS also performs layer separation between the latent video and the selected occluder. By disentangling occlusions, appearance, and motion, our method allows for better spatiotemporal modeling and editing of diverse videos, including in-the-wild talking head videos with challenging occlusions, shadows, and specularities while maintaining a reasonable canonical space for editing.We also present general video modeling results on the DAVIS, and CoDeF datasets, as well as our own talking head video dataset collected from open-source web videos. Extensive ablations show the combination of Ds and Dc under neural splines can overcome motion and appearance ambiguities, paving the way to more advanced video editing models.
Poster
Alexander Pondaven · Aliaksandr Siarohin · Sergey Tulyakov · Philip H.S. Torr · Fabio Pizzati

[ ExHall D ]

Abstract
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation. Our code will be open source.
Poster
Yuchi Wang · Junliang Guo · Xinyi Xie · Tianyu He · Xu Sun · Jiang Bian

[ ExHall D ]

Abstract
Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20\% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.
Poster
Maria Pilligua · Danna Xue · Javier Vazquez-Corral

[ ExHall D ]

Abstract
Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos.
Poster
Yang Hai · Guo Wang · Tan Su · jerett · Yinlin Hu

[ ExHall D ]

Abstract
We present an efficient diffusion-based method for video frame interpolation. Most recent diffusion-based methods still have a large gap from non-diffusion methods in accuracy and efficiency. The key of our method is, instead of formulating the problem as a denoising procedure in the latent space directly, which is less effective caused by the large latent space, we propose to model optical flow explicitly from coarse to fine by a hierarchical diffusion models, which has much smaller search space in each denoising step, and can handle complex motions and large displacements. Extensive evaluation on multiple benchmarks demonstrates that our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods.
Poster
Ding Ding · Yueming Pan · Ruoyu Feng · Qi Dai · Kai Qiu · Jianmin Bao · Chong Luo · Zhenzhong Chen

[ ExHall D ]

Abstract
In this paper, we present HomoGen, an enhanced video inpainting method based on homography propagation and diffusion models. HomoGen leverages homography registration to propagate contextual pixels as priors for generating missing content in corrupted videos. Unlike previous flow-based propagation methods, which introduce local distortions due to point-to-point optical flows, homography-induced artifacts are typically global structural distortions that preserve semantic integrity. To effectively utilize these priors for generation, we employ a video diffusion model that inherently prioritizes semantic information within the priors over pixel-level details. A content-adaptive control mechanism is proposed to scale and inject the priors into intermediate video latents during iterative denoising. In contrast to existing transformer-based networks that often suffer from artifacts within priors, leading to error accumulation and unrealistic results, our denoising diffusion network can smooth out artifacts and ensure natural output. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively.
Poster
Tianwei Yin · Qiang Zhang · Richard Zhang · William Freeman · Fredo Durand · Eli Shechtman · Xun Huang

[ ExHall D ]

Abstract
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address these limitations by introducing an autoregressive diffusion transformer that is adapted from a pretrained bidirectional video diffusion model. Our key innovations are twofold: First, we extend distribution matching distillation (DMD) to videos, compressing a 50-step denoising process into just 4 steps. Second, we develop an asymmetric distillation approach where a causal student model learns from a bidirectional teacher with privileged future information. This strategy effectively mitigates error accumulation in autoregressive generation, enabling high-quality long-form video synthesis despite training on short clips. Our model achieves a total score of 82.85 on VBench-Long, outperforming all published approaches and, mostly importantly, uniquely enabling fast streaming inference on single GPU at 9.4 FPS. Our method also supports streaming video editing, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
Poster
Shuyun Wang · Hu Zhang · Xin Shen · Dadong Wang · Xin Yu

[ ExHall D ]

Abstract
Bitstream-corrupted video recovery aims to fill in realistic video content due to bitstream corruption during video storage or transmission.Most existing methods typically assume that the predefined masks of the corrupted regions are known in advance.However, manually annotating these input masks is laborious and time-consuming, limiting the applicability of existing methods in real-world scenarios. Therefore, we expect to relax this assumption by defining a new blind video recovery setting where the recovery of corrupted regions does not rely on predefined masks.There are two primary challenges in this scenario: (i) without predefined masks, how accurately can a model identify the regions requiring recovery?(ii) how to recover extensive and irregular contents, especially when large portions of frames are severely degraded or large-scale corrupted?To address these challenges, we introduce a Metadata-Guided Diffusion Model, dubbed M-GDM.To enable a diffusion model to focus on the corrupted regions, we leverage inherent video metadata as a corruption indicator and design a dual-stream metadata encoder.This encoder first processes the motion vectors and frame types of a video separately, and then merges them into a unified metadata representation.The metadata representation will interact with the corrupted latent feature via cross-attention in each diffusion step.Meanwhile, to preserve the intact regions, we propose …
Poster
Qian Wang · Abdelrahman Eldesokey · Mohit Mendiratta · Fangneng Zhan · Adam Kortylewski · Christian Theobalt · Peter Wonka

[ ExHall D ]

Abstract
We introduce the first training-free approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to a high-quality full resolution. Experiments show that our proposed …
Poster
Yue-Hua Han · Tai-Ming Huang · Kailung Hua · Jun-Cheng Chen

[ ExHall D ]

Abstract
Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. While research in Deepfake detection has advanced rapidly, many methods still struggle to generalize to unseen Deepfakes generated by novel synthesis techniques. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues based on the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce the Facial Component Guidance (FCG) to enhance the spatial learning generalizability by encouraging the model to focus on key facial regions. The cross-dataset evaluation demonstrates the superior performance of our approach, surpassing state-of-the-art methods on challenging datasets. Extensive experiments further validate the effectiveness of the proposed method in terms of data efficiency, parameter efficiency and model robustness.
Poster
Zhenxuan Fang · Fangfang Wu · Tao Huang · Le Dong · Weisheng Dong · Xin Li · Guangming Shi

[ ExHall D ]

Abstract
Unlike global motion blur, Local Motion Deblurring (LMD) presents a more complex challenge, as it requires precise restoration of blurry regions while preserving the sharpness of the background. Existing LMD methods rely on manually annotated blur masks and often overlook the blur kernel's characteristics, which are crucial for accurate restoration. To address these limitations, we propose a novel parameterized motion kernel modeling approach that defines the motion blur kernel with three key parameters: length, angle, and curvature. We then use networks to estimate these kernel parameters, significantly improving the accuracy of blur kernel estimation. To effectively learn the motion blur representation, we incorporate a shared memory bank that stores blur prior information. Additionally, we introduce a dual-branch deblurring network: one branch leverages Mamba to capture long-range dependencies, while the other uses a mask-guided CNN focused on refining the local blurry regions. By fully utilizing the estimated bur prior information, our approach greatly enhances deblurring outcomes. Experimental results show that our method achieves state-of-the-art performance both quantitatively and visually, with a substantial reduction in computational complexity.
Poster
Nicolas Dufour · Vicky Kalogeiton · David Picard · Loic Landrieu

[ ExHall D ]

Abstract
Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.
Poster
Shasha Mao · Shiming Lu · Zhaolong Du · Licheng Jiao · Shuiping Gou · Luntian Mou · Xuequan Lu · Lin Xiong · Yimeng Zhang

[ ExHall D ]

Abstract
Synthetic Aperture Radar (SAR) image registration is an essential upstream task in geoscience applications, in which pre-detected keypoints from two images are employed as observed objects to seek matched-point pairs. In general, the registration is regarded as a typical closed-set classification, which forces each keypoint to be classified into the given classes, but ignoring an essential issue that numerous redundant keypoints are beyond the given classes, which unavoidably results in capturing incorrect matched-point pairs. Based on this, we propose a Cross-Rejective Open-set SAR Image Registration (CroR-OSIR) method. In this work, these redundant keypoints are regarded as out-of-distribution (OOD) samples, and we formulate the registration as a special open-set task with two modules: supervised contrastive feature-tuning and cross-rejective open-set recognition (CroR-OSR). Different from traditional open set recognition, all samples including OOD samples are available in the CroR-OSR module. CroR-OSR conducts the closed-set classifications in individual open-set domains from two images, meanwhile employing the cross-domain rejection during training, to exclude these OOD samples based on confidence and consistency. Moreover, a new supervised contrastive tuning strategy is incorporated for feature-tuning. Especially, the cross-domain estimation labels obtained by CroR-OSR are fed back to the feature-tuning module for feature-tuning, to enhance feature discriminability. Experimental results …
Poster
Zichen Tian · Yaoyao Liu · Qianru Sun

[ ExHall D ]

Abstract
Training large foundation models of remote-sensing (RS) images is almost impossible due to the limited and long-tailed data problems. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters -- such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly. Our code is available in the supplementary materials for review.
Poster
Jingtao Li · Yingyi Liu · XINYU WANG · Yunning Peng · Chen Sun · Shaoyu Wang · Zhendong Sun · Tian Ke · Xiao Jiang · Tangwei Lu · Anran Zhao · Yanfei Zhong

[ ExHall D ]

Abstract
Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels, existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from 0.42.5μm, supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 datasets. Code would be accessible at XXXX.
Poster
Jiangwei Ren · Xingyu Jiang · Zizhuo Li · Dingkang Liang · Xin Zhou · Xiang Bai

[ ExHall D ]

Abstract
Image matching for both cross-view and cross-modality plays a critical role in multi-modal perception. Due to the modality gap caused by different imaging systems/styles, the matching task poses great challenges. Existing works try to extract invariant features for specific modality, and train on limited dataset, showing poor generalization. To this end, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance the universal performance from the perspective of data scaling-up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate labeling. Specifically, we scale-up the modalities from cheap but rich RGB-only matching data, by means of generative modules. With this setting, the matching labels and rich diversity of RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multi-modal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modality ability. Extensive experiments on synthetic and real datasets demonstrate that our MINIMA can achieve large enhancement for cross-modal …
Poster
Sungpyo Kim · Jeonghyeok Do · Jaehyup Lee · Munchurl Kim

[ ExHall D ]

Abstract
Conventional methods for PAN-sharpening often struggle to restore fine details due to limitations in leveraging high-frequency information. Moreover, diffusion-based approaches lack sufficient conditioning to fully utilize Panchromatic (PAN) images and low-resolution multispectral (LRMS) inputs effectively. To address these challenges, we propose an uncertainty-aware knowledge distillation diffusion framework with details enhancement for PAN-sharpening, called U-Know-DiffPAN. The U-Know-DiffPAN incorporates uncertainty-aware knowledge distillation for effective transfer of feature details from our teacher model to a student one. The teacher model in our U-Know-DiffPAN captures frequency details through freqeuncy selective attention, facilitating accurate reverse process learning. By conditioning the encoder on compact vector representations of PAN and LRMS and the decoder on Wavelet transforms, we enable rich frequency utilization. So, the high-capacity teacher model distills frequency-rich features into a lightweight student model aided by an uncertainty map. From this, the teacher model can guide the student model to focus on difficult image regions for PAN-sharpening via the usage of the uncertainty map. Extensive experiments on diverse datasets demonstrate the robustness and superior performance of our U-Know-DiffPAN over very recent state-of-the-art PAN-sharpening methods. The source code is available at https://github.com/xxx/yyy.
Poster
Xin Di · Long Peng · Peizhe Xia · Wenbo Li · Renjing Pei · Yang Wang · Yang Cao · Zheng-Jun Zha

[ ExHall D ]

Abstract
Burst super-resolution (BurstSR) aims to reconstruct high-resolution images by fusing subpixel details from multiple low-resolution burst frames. The primary challenge lies in effectively extracting useful information while mitigating the impact of high-frequency noise. Most existing methods rely on frame-by-frame fusion, which often struggles to distinguish informative subpixels from noise, leading to suboptimal performance. To address these limitations, we introduce a novel Query Mamba Burst Super-Resolution (QMambaBSR) network. Specifically, we observe that sub-pixels have consistent spatial distribution while noise appears randomly. Considering the entire burst sequence during fusion allows for more reliable extraction of consistent subpixels and better suppression of noise outliers. Based on this, a Query State Space Model (QSSM) is proposed for both inter-frame querying and intra-frame scanning, enabling a more efficient fusion of useful subpixels. Additionally, to overcome the limitations of static upsampling methods that often result in over-smoothing, we propose an Adaptive Upsampling (AdaUp) module that dynamically adjusts the upsampling kernel to suit the characteristics of different burst scenes, achieving superior detail reconstruction. Extensive experiments on four benchmark datasets—spanning both synthetic and real-world images—demonstrate that QMambaBSR outperforms existing state-of-the-art methods. The code will be publicly available.
Poster
Ruiyi Wang · Yushuo Zheng · Zicheng Zhang · Chunyi Li · Shuaicheng Liu · Guangtao Zhai · Xiaohong Liu

[ ExHall D ]

Abstract
Existing real-world image dehazing methods typically attempt to fine-tune pre-trained models or adapt their inference procedures, placing significant reliance on the quality of pre-training data. Although generative diffusion models have shown potential in restoring heavily distorted information, their application in dehazing remains constrained due to extensive sampling steps and fidelity limitations. To address these challenges, we propose a two-stage hazing-dehazing pipeline, which integrates the Realistic Haze Generation Framework (HazeGen) and the Diffusion-based Dehazing Framework (DiffDehaze). Specifically, HazeGen takes advantage of the rich generative diffusion prior of real-world hazy images embedded in the pre-trained text-to-image diffusion model and leverages IRControlNet to realize conditional generation. To further improve haze authenticity and generation diversity, HazeGen utilizes the hybrid training and the blended sampling approaches to generate high-quality training data for DiffDehaze. In order to leverage generative capacity while retaining efficiency, DiffDehaze employs the Accelerated Fidelity-Preserving Sampling Strategy (AccSamp). With a Patch-based Statistical Alignment Operation (AlignOp), DiffDehaze can quickly generate a faithful dehazing estimate in few sampling steps, which can be used to reduce sampling steps and enables a haze density-aware fidelity guidance. Extensive visual comparisons and quantitative evaluations demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The …
Poster
Zeyu Mi · Yu-Bin Yang

[ ExHall D ]

Abstract
Data augmentation (DA) stands out as a powerful technique to enhance the generalization capabilities of deep neural networks across diverse tasks. However, in low-level vision tasks, DA remains rudimentary (i.e., vanilla DA), facing a critical bottleneck due to information loss. In this paper, we introduce a novel Calibrated Attribution Map (CAM) to generate saliency masks, followed by two saliency-based DA methods—ADD and ADD+—designed to address this issue. CAM leverages integrated gradients and incorporates two key innovations: a global feature detector and calibrated integrated gradients. Based on CAM and the proposed methods, we highlight two key insights for low-level vision tasks: (1) increasing pixel diversity, as seen in vanilla DA, can improve performance, and (2) focusing on salient features while minimizing the impact of irrelevant pixels, as seen in saliency-based DA, more effectively enhances model performance. Additionally, we propose two guiding principles for designing saliency-based DA: coarse-grained partitioning and diverse augmentation strategies. Extensive experiments demonstrate the compatibility and consistent, significant performance improvement of our method across various SR tasks and networks.
Poster
Heemin Yang · Jaesung Rim · Seungyong Lee · Seung-Hwan Baek · Sunghyun Cho

[ ExHall D ]

Abstract
In this paper, we present GyroDeblurNet, a novel single image deblurring method that utilizes a gyro sensor to effectively resolve the ill-posedness of image deblurring.The gyro sensor provides valuable information about camera motion that can significantly improve deblurring quality.However, effectively exploiting real-world gyro data is challenging due to significant errors from various sources.To handle these errors, GyroDeblurNet is equipped with two novel neural network blocks: a gyro refinement block and a gyro deblurring block.The gyro refinement block refines the erroneous gyro data using the blur information from the input image.The gyro deblurring block removes blur from the input image using the refined gyro data and further compensates for gyro error by leveraging the blur information from the input image.For training a neural network with erroneous gyro data, we propose a training strategy based on the curriculum learning.We also introduce a novel gyro data embedding scheme to represent real-world intricate camera shakes.Finally, we present both synthetic and real-world datasets for training and evaluating gyro-based single image deblurring.Our experiments demonstrate that our approach achieves state-of-the-art deblurring quality by effectively utilizing erroneous gyro data.
Poster
Yidi Liu · Dong Li · Xueyang Fu · Xin Lu · Jie Huang · Zheng-Jun Zha

[ ExHall D ]

Abstract
We introduce UHD-Processor, a unified and robust framework for all-in-one image restoration, which is particularly resource-efficient for Ultra-High-Definition (UHD) images. To address the limitations of traditional all-in-one methods that rely on complex restoration backbones, our strategy employs a frequency domain decoupling progressive learning technique, motivated by curriculum learning, to incrementally learn restoration mappings from low to high frequencies. This approach incorporates specialized sub-network modules to effectively tackle different frequency bands in a divide-and-conquer manner, significantly enhancing the learning capability of simpler networks. Moreover, to accommodate the high-resolution characteristics of UHD images, we developed a variational autoencoder (VAE)-based framework that reduces computational complexity by modeling a concise latent space. It integrates task-specific degradation awareness in the encoder and frequency selection in the decoder, enhancing task comprehension and generalization. Our unified model is able to handle various degradations such as denoising, deblurring, dehazing, low-lighting, etc. Experimental evaluations extensively showcase the effectiveness of our dual-strategy approach, significantly improving UHD image restoration and achieving cutting-edge performance across diverse conditions.
Poster
Yuheng Xu · Shijie Yang · Xin Liu · Jie Liu · Jie Tang · Gangshan Wu

[ ExHall D ]

Abstract
In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are dynamically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network’s ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets . For SPF-LUT, with more than a 50\% reduction in storage …
Poster
Kangfu Mei · Vishal M. Patel · Mojtaba Sahraee-Ardakan · Hossein Talebi · Peyman Milanfar · Mauricio Delbracio

[ ExHall D ]

Abstract
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity.
Poster
Zongsheng Yue · Kang Liao · Chen Change Loy

[ ExHall D ]

Abstract
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a \textit{Partial noise Prediction} strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model will be made publicly available.
Poster
Bingliang Zhang · Wenda Chu · Julius Berner · Chenlin Meng · Anima Anandkumar · Yang Song

[ ExHall D ]

Abstract
Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems.
Poster
Marina Alterman · Anat Levin

[ ExHall D ]

Abstract
Transmission matrices, mapping the propagation of light from one end of the tissue to the other, form an important mathematical tool in the analysis of tissue scattering and the design of wavefront shaping systems. To understand the relationship between their content and the volumetric structure of the tissue, we wish to fit them with multi-slice models, composed of a set of planar aberrations spaced throughout the volume. The number of layers used in such a model would largely affect the amount of information compression and the ease in which we can use such layered models in a wavefront-shaping system. This work offers a theoretical study of such multi-layered models. We attempt to understand how many layers are required for a good fit, and how does the approximation degrade when a smaller number of such layers is used. We show analytically that transmission matrices can be well fitted with very sparse layers. This leads to optimistic predictions on our ability to use them to design future wavefront shaping systems which can correct tissue aberration over a wide field-of-view.
Poster
linwei dong · Qingnan Fan · Yihong Guo · Zhonghao Wang · Qi Zhang · Jinwei Chen · Yawei Luo · Changqing Zou

[ ExHall D ]

Abstract
Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.
Poster
Matthieu Terris · Ulugbek Kamilov · Thomas Moreau

[ ExHall D ]

Abstract
Selecting an appropriate prior to compensate for information loss due to the measurement operator is a fundamental challenge in imaging inverse problems. Implicit priors based on denoising neural networks have become central to widely-used frameworks such as Plug-and-Play (PnP) algorithms. In this work, we introduce Fixed-points of Restoration (FiRe) priors as a new framework for expanding the notion of priors in PnP to general restoration models beyond traditional denoising models. The key insight behind FiRe is that natural images emerge as fixed points of the composition of a degradation operator with the corresponding restoration model. This enables us to derive an explicit formula for our implicit prior by quantifying invariance of images under this composite operation. Adopting this fixed-point perspective, we show how various restoration networks can effectively serve as priors for solving inverse problems. The FiRe framework further enables ensemble-like combinations of multiple restoration models as well as acquisition-informed restoration networks, all within a unified optimization approach. Experimental results validate the effectiveness of FiRe across various inverse problems, establishing a new paradigm for incorporating pretrained restoration models into PnP-like algorithms.
Poster
Junyuan Deng · Xinyi Wu · Yongxing Yang · Congchao Zhu · Song Wang · Zhenyao Wu

[ ExHall D ]

Abstract
Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.
Poster
Chong Wang · Lanqing Guo · Zixuan Fu · SIYUAN YANG · Hao Cheng · Alex C. Kot · Bihan Wen

[ ExHall D ]

Abstract
Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned discriminative denoiser as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained generative diffusion model, has gained great popularity for solving IR problems through stochastic sampling. The IR results using PnP with a pre-trained diffusion model demonstrate distinct advantages compared to those using discriminative denoisers, i.e., improved perceptual quality while sacrificing the data fidelity. The unsatisfactory results are due to the lack of integration of these strategies in the IR tasks.In this work, we propose a novel zero-shot IR scheme, dubbed Reconciling Diffusion Model in Dual (RDMD), which leverages only a single pre-trained diffusion model to construct two complementary regularizers.Specifically, the diffusion model in RDMD will iteratively perform deterministic denoising and stochastic sampling, aiming to achieve high-fidelity image restoration with appealing perceptual quality.RDMD also allows users to customize the distortion-perception tradeoff with a single hyperparameter, enhancing the adaptability of the restoration process in different practical scenarios.Extensive experiments on several IR tasks demonstrate that our proposed method could achieve superior results compared to existing approaches on both the FFHQ and ImageNet datasets.We will release the …
Poster
Xinrui Wang · Lanqing Guo · Xiyu Wang · Siyu Huang · Bihan Wen

[ ExHall D ]

Abstract
Recent advancements in deep learning have yielded promising results for the image shadow removal task. However, most existing methods rely on binary pre-generated shadow masks. The binary nature of such masks could potentially lead to artifacts near the boundary between shadow and non-shadow areas. In view of this, inspired by the physical model of shadow formation, we introduce novel soft shadow masks specifically designed for shadow removal. To achieve such soft masks, we propose a SoftShadow framework by leveraging the prior knowledge of pretrained SAM and integrating physical constraints. Specifically, we jointly tune the SAM and the subsequent shadow removal network using penumbra formation constraint loss, mask reconstruction loss, and shadow removal loss. This framework enables accurate predictions of penumbra (partially shaded) and umbra (fully shaded) areas while simultaneously facilitating end-to-end shadow removal. Through extensive experiments on popular datasets, we found that our SoftShadow framework, which generates soft masks, can better restore boundary artifacts, achieve state-of-the-art performance, and demonstrate superior generalizability.
Poster
Xingyu Qiu · Mengying Yang · Xinghua Ma · Fanding Li · Dong Liang · Gongning Luo · wei wang · Kuanquan Wang · Shuo Li

[ ExHall D ]

Abstract
In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights fA(t) and fB(t), as they follow the paradigm (xt=fA(t)xImg+fB(t)ϵ). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced …
Poster
Yikai Wang · Chenjie Cao · Junqiu Yu · Ke Fan · Xiangyang Xue · Yanwei Fu

[ ExHall D ]

Abstract
Recent advances in image inpainting increasingly use generative models to handle large irregular masks. However, these models can create unrealistic inpainted images due to two main issues: (1) Context Instability: Even with unmasked areas as context, generative models may still generate arbitrary objects in the masked region that don’t align with the rest of the image. (2) Hue Inconsistency: Inpainted regions often have color shifts that causes a smeared appearance, reducing image quality.Retraining the generative model could help solve these issues, but it’s costly since state-of-the-art latent-based diffusion and rectified flow models require a three-stage training process: training a VAE, training a generative U-Net or transformer, and fine-tuning for inpainting.Instead, this paper proposes a post-processing approach, dubbed as ASUKA (Aligned Stable inpainting with UnKnown Areas prior), to improve inpainting models. To address context instability, we leverage a Masked Auto-Encoder (MAE) for reconstruction-based priors. This strengthens context alignment while maintaining the model's generation capabilities. To address hue inconsistency, we propose a specialized VAE decoder that treats latent-to-image decoding as a local harmonization task, significantly reducing color shifts for hue-consistent inpainting. We validate ASUKA on SD 1.5 and FLUX inpainting variants using the Places2 benchmark and MISATO, our proposed diverse collection of …
Poster
Zhe Zhang · Zhenzhong Chen · Shan Liu

[ ExHall D ]

Abstract
Neural lossless image compression methods have recently achieved impressive compression ratios by fitting neural networks to represent data distributions of large datasets. However, these methods often require complex networks to capture intricate data distributions effectively, resulting in high decoding complexity. In this paper, we present a novel approach named Fitted Neural Lossless Image Compression (FNLIC) that enhances efficiency through a two-phase fitting process. For each image, a latent variable model is overfitted to optimize the representation of the individual image's probability distribution, which is inherently simpler than the distribution of an entire dataset and requires less complex neural networks. Additionally, we pre-fit a lightweight autoregressive model on a comprehensive dataset to learn a beneficial prior for overfitted models. To improve coordination between the pre-fitting and overfitting phases, we introduce independent fitting for the pre-fitter and the adaptive prior transformation for the overfitted model. Extensive experimental results on high-resolution datasets show that FNLIC achieves competitive compression ratios compared to both traditional and neural lossless image compression methods, with decoding complexity significantly lower than other neural methods of similar performance. The code will be made publicly available upon publication.
Poster
Jona Ballé · Luca Versari · Emilien Dupont · Hyunjik Kim · Matthias Bauer

[ ExHall D ]

Abstract
Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply–accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
Poster
Xuewen Liu · Zhikai Li · Qingyi Gu

[ ExHall D ]

Abstract
Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18× speedup and 4× compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in …
Poster
Qianli Ma · Xuefei Ning · Dongrui Liu · Li Niu · Linfeng Zhang

[ ExHall D ]

Abstract
Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a Decouple-then-Merge (DeMe) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10. Code is included in the supplementary material and will be released on Github.
Poster
Youyuan Zhang · Zehua Liu · Zenan Li · Zhaoyu Li · James Clark · Xujie Si

[ ExHall D ]

Abstract
In this paper, we consider the conditional generation problem by guiding off-the-shelf unconditional diffusion models with differentiable loss functions in a plug-and-play fashion. While previous research has primarily focused on balancing the unconditional diffusion model and the guided loss through a tuned weight hyperparameter, we propose a novel framework that distinctly decouples these two components. Specifically, we introduce two variables x and z, to represent the generated samples governed by the unconditional generation model and the guidance function, respectively. This decoupling reformulates conditional generation into two manageable subproblems, unified by the constraint x=z. Leveraging this setup, we develop a new algorithm based on the Alternating Direction Method of Multipliers (ADMM) to adaptively balance these components. Additionally, we establish the equivalence between the diffusion reverse step and the proximal operator of ADMM and provide a detailed convergence analysis of our algorithm under certain mild assumptions. Our experiments demonstrate that our proposed method \OurMethod{} consistently generates high-quality samples while ensuring strong adherence to the conditioning criteria. It outperforms existing methods across a range of conditional generation tasks, including image generation with various guidance and controllable motion synthesis.
Poster
Hao Lin · Ke Wu · Jie Li · Jun Li · Wu-Jun Li

[ ExHall D ]

Abstract
Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80× in throughput and reduces strategy optimization time by up to 107× across five Transformer-based models.
Poster
Mashrur M. Morshed · Vishnu Naresh Boddeti

[ ExHall D ]

Abstract
Many real-world applications of flow-based generative models desire a diverse set of samples covering multiple modes of the target distribution. However, the predominant approach for obtaining diverse sets is not sample-efficient, as it involves independently obtaining many samples from the source distribution and mapping them through the flow until the desired mode coverage is achieved. As an alternative to repeated sampling, we introduce DiverseFlow: a training-free, inference-time approach to improve the diversity of flow models. Our key idea is to employ a determinantal point process to induce a coupling between the samples that drives diversity under a fixed sampling budget. In essence, DiverseFlow enables exploring more variations in a learned flow model with a fewer number of samples. We demonstrate the efficacy of our method for tasks where sample efficient diversity is desirable, such as, text-guided image generation with polysemous words, inverse problems like large-hole inpainting, and class-conditional image synthesis.
Poster
Junhyuk So · Jiwoong Shin · Chaeyeon Jang · Eunhyeok Park

[ ExHall D ]

Abstract
Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM’s limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.
Poster
David McAllister · Matthew Tancik · Jiaming Song · Angjoo Kanazawa

[ ExHall D ]

Abstract
Large-scale AI model training divides work across thousands of GPUs then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework to distribute diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, they ensemble through a lightweight router. We show that this ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of compute islands,'' lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can …
Poster
Zigeng Chen · Xinyin Ma · Gongfan Fang · Xinchao Wang

[ ExHall D ]

Abstract
In the rapidly advancing field of image generation, *Visual Auto-Regressive* (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To overcome these bottlenecks, we propose *Collaborative Decoding* (CoDe), a novel decoding strategy tailored to the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration, reaching over 41 images/s at 256x256 …
Poster
Ye Chen · Zhangli Hu · Zhongyin Zhao · Yupeng Zhu · Yue Shi · Yuxuan Xiong · Bingbing Ni

[ ExHall D ]

Abstract
Current parameterized image representations embed visual information along the semantic boundaries and struggle to express the internal detailed texture structures of image components, leading to a lack of content consistency after image editing and driving. To address these challenges, this work proposes a novel parameterized representation based on hierarchical image proxy geometry, utilizing multi-layer hierarchically interrelated proxy geometric control points to embed multi-scale long-range structures and fine-grained texture details. The proposed representation enables smoother and more continuous interpolation during image rendering and ensures high-quality consistency within image components during image editing. Additionally, under the layer-wise representation strategy based on semantic-aware image layer decomposition, we enable decoupled image shape/texture editing of the targets of interest within the image. Extensive experimental results on image vectorization and editing tasks demonstrate that our proposed method achieves high rendering accuracy of general images, including natural images, with a significantly higher image parameter compression ratio, facilitating user-friendly editing of image semantic components.
Poster
Yael Vinker · Tamar Rott Shaham · Kristine Zheng · Alex Zhao · Judith Fan · Antonio Torralba

[ ExHall D ]

Abstract
Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions.Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks.By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
Poster
Xihua Wang · Ruihua Song · Chongxuan Li · Xin Cheng · Boyuan Li · Yihan Wu · Yuyue Wang · Hongteng Xu · Yunfeng Wang

[ ExHall D ]

Abstract
This paper addresses a promising yet underexplored task, Image-to-Sounding-Video (I2SV) generation, which animates a static image and generates synchronized sound simultaneously. Despite advances in video and audio generation models, some challenges remain to develop a unified model for generating naturally sounding videos. In this work, we propose a novel approach that leverages two separate pretrained diffusion models and makes vision and audio influence each other during generation based on the Diffusion Transformer (DiT) architecture. First, the individual video and audio generation models are decomposed into input, output, and expert sub-modules. We propose using a unified joint DiT block in the expert sub-modules to effectively model the interaction between the two modalities, resulting in high-quality I2SV generation. Then, we introduce a joint classifier-free guidance technique to boost the performance during joint generation.Finally, we conduct extensive experiments on three popular benchmark datasets, and in both objective and subjective evaluation our method surpass all the baseline methods in almost all metrics. Case studies show that our generated sounding videos are high quality and synchronized between video and audio.
Poster
Feng-Lin Liu · Hongbo Fu · Xintao Wang · Weicai Ye · Pengfei Wan · Di ZHANG · Lin Gao

[ ExHall D ]

Abstract
Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing. We will release our code after acceptance.
Poster
Dingkun Yan · Xinrui Wang · Zhuoru Li · Suguru Saito · Yusuke Iwasawa · Yutaka Matsuo · Jiaxian Guo

[ ExHall D ]

Abstract
Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-guided methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial reference and an RGB image as the color guidance, and separately extracts foreground and background information from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules for foreground and background separately trained to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the need to fine-tune model parameters. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes and trained models will …
Poster
Junyu Gao · Kunlin Yang · Xuan Yao · Yufan Hu

[ ExHall D ]

Abstract
Recently, text-driven video editing methods that optimize target latent representations have garnered significant attention and demonstrated promising results. However, these methods rely on self-supervised objectives to compute the gradients needed for updating latent representations, which inevitably introduces gradient noise, compromising content generation quality. Additionally, it is challenging to determine the optimal stopping point for the editing process, making it difficult to achieve an optimal solution for the latent representation. To address these issues, we propose a unified gradient-latent purification framework that collects gradient and latent information across different stages to identify effective and concordant update directions. We design a local coordinate system construction method based on feature decomposition, enabling short-term gradients and final-stage latents to be reprojected onto new axes. Then, we employ tailored coefficient regularization terms to effectively aggregate the decomposed information. Additionally, a temporal smoothing axis extension strategy is developed to enhance the temporal coherence of the generated content. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art methods across various editing tasks, delivering superior editing performance. Code is available in the Supplementary Material.
Poster
Zilyu Ye · Zhiyang Chen · Tiancheng Li · Zemin Huang · Weijian Luo · Guo-Jun Qi

[ ExHall D ]

Abstract
Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts.In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning to maximize the final image quality while discounting the number of denoising steps.With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
Poster
Ravishankar Evani · Deepu Rajan · Shangbo Mao

[ ExHall D ]

Abstract
Texture recognition has more recently relied on Neural Networks that are Convolution, Transformer and Graph based. However, many of these methods fail to effectively incorporate frequency characteristics exhibited by visual and latent texture attributes. In addition, effective orderless representation of textures before mapping from latent to visual texture attributes has not been fully explored. Finally, there is no loss function that has been designed specifically for texture and material recognition tasks. In this study, we introduce the Chebyshev Attention Depth Permutation Texture Network (CAPTN), which by using texture frequency attention mechanisms and convolution operations to generate latent texture attributes. These attributes are then enhanced by permuting the feature space. CAPTN then incorporates a non-linear learnable Chebyshev function to improve mapping of orderless enhanced latent texture attributes to visual texture attributes. Finally, we propose Latent Texture Attribute Loss to understanding spatial texture characteristics and enforce distributional consistency of orderless latent texture attribute representations. CAPTN allows end-to-end training without the need to fine-tune pre-trained CNN backbones. Experiments show that CAPTN achieves state-of-the-art results on multiple benchmark texture and material datasets.
Poster
Shuhao Zhang · Hui Kang · Yang Liu · Fang Mei · Hongjuan Li

[ ExHall D ]

Abstract
Attention-based arbitrary style transfer methods have gained significant attention recently due to their impressive ability to synthesize style details. However, the point-wise matching within the attention mechanism may overly focus on local patterns such that neglect the remarkable global features of style images. Additionally, when processing large images, the quadratic complexity of the attention mechanism will bring high computational load. To alleviate above problems, we propose Holistic Style Injector (HSI), a novel attention-style transformation module to deliver artistic expression of target style. Specifically, HSI performs stylization only based on global style representation that is more in line with the characteristics of style transfer, to avoid generating local disharmonious patterns in stylized images. Moreover, we propose a dual relation learning mechanism inside the HSI to dynamically render images by leveraging semantic similarity in content and style, ensuring the stylized images preserve the original content and improve style fidelity. Note that the proposed HSI achieves linear computational complexity because it establishes feature mapping through element-wise multiplication rather than matrix multiplication. Qualitative and quantitative results demonstrate that our method outperforms state-of-the-art approaches in both effectiveness and efficiency.
Poster
Mingkun Lei · Xue Song · Beier Zhu · Hao Wang · Chi Zhang

[ ExHall D ]

Abstract
Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content.In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
Poster
Srikar Yellapragada · Alexandros Graikos · Kostas Triaridis · Prateek Prasanna · Rajarsi Gupta · Joel Saltz · Dimitris Samaras

[ ExHall D ]

Abstract
Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to 4096×4096 pixels and 4× super-resolution. Additionally, multi-scale features extracted from …
Poster
Jinjin Zhang · qiuyu Huang · Junjie Liu · Xiefan Guo · Di Huang

[ ExHall D ]

Abstract
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models.The core advancements include:(1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o.Additionally, we introduce GLCM Score and compression ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images.(2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images.Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B).Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.The code and dataset will be made publicly available soon.
Poster
Yoonjeon Kim · Soohyun Ryu · Yeonsung Jung · Hyunkoo Lee · Joowon Kim · June Yong Yang · Jaeryong Hwang · Eunho Yang

[ ExHall D ]

Abstract
The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, which is indiscriminately applying the same criteria on completely different contexts and biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP generates detailed textual descriptions of the source and target, then calculates a modification vector through a hyperplane in CLIP space that separates source and target attributes. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The …
Poster
Shanshan Huang · Haoxuan Li · Chunyuan Zheng · Lei Wang · Guorui Liao · Zhili Gong · Huayi Yang · Li Liu

[ ExHall D ]

Abstract
A key challenge for controllable image editing is the fact that visual attributes with semantic meanings are not always independent of each other, resulting in spurious correlations in model training. However,most existing methods ignore such issue, leading to biased causal representations learning and unintended changes to unrelated features in the edited images.This leads us to present a diffusion-based causal representation learning framework called CIDiffuser that employs structural causal models (SCMs) to capture causal representations of visual attributes to address the spurious correlation.The framework first adopts a semanticencoder to decompose the representation into the target part, which includes visual attributes of interest to the user, and the other" part.We then introduce a direct causal effect learning module to capture the total direct causal effect between the potential outcomes before and after intervening on the visual attributes.In addition, a diffusion-based learning strategy is designed to optimize the representation learning process.Empirical evaluations on two benchmark datasets and one in-house dataset suggest our approach significantly outperforms the state-of-the-art methods, enabling controllable image editing by modifying learned visual representations.
Poster
Wenhao Gu · Li Gu · Ching Suen · Yang Wang

[ ExHall D ]

Abstract
Recent advancements in handwritten text recognition (HTR) have enabled effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompts adaptation with unlabeled test-time examples. To ensure the self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods with up to 8x speedup. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in …
Poster
Zihao Wang · Yuxiang Wei · Fan Li · Renjing Pei · Hang Xu · Wangmeng Zuo

[ ExHall D ]

Abstract
Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available.
Poster
Shoufa Chen · Chongjian GE · Yuqi Zhang · Yida Zhang · Fengda Zhu · Hao Yang · Hongxiang Hao · hui wu · Zhichao Lai · Yifei Hu · Ting-Che Lin · Shilong Zhang · Fu Li · Chuan Li · Xing Wang · Yanghua Peng · Peize Sun · Ping Luo · Yi Jiang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu

[ ExHall D ]

Abstract
This paper presents our latest advancements, *Goku*, a new family of joint image-and-video generation models based on rectified flow Transformers to achieve industry-grade performance. We present the foundational elements required for high-quality visual generation, including data curation, model design, flow formulation, etc. Key contributions inclued a meticulous data filtering pipeline that ensures high-quality, fine-grained image and video data curation; and the pioneering use of rectified flow for enhanced interaction among video and image tokens. Goku models achieve superior performance in both qualitative and quantitative assessments. Notably, \ours achieves top scores on major benchmarks: 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, alongside 82.7 on VBench for text-to-video tasks. We hope this report offers valuable insights into joint image-and-video generation models for the research community.
Poster
Weimin Qiu · Jieke Wang · Meng Tang

[ ExHall D ]

Abstract
Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.
Poster
Chao Wang · Hehe Fan · Huichen Yang · Sarvnaz Karimi · Lina Yao · Yi Yang

[ ExHall D ]

Abstract
Diffusion-based Text-to-Image (T2I) models have demonstrated significant potential in image restoration. However, existing models continue to grapple with challenges such as complex training and prompt design. We introduce a new perspective for improving image restoration by injecting knowledge from pretrained vision-language models into current T2I models. We empirically show that the degradation and content representations in BLIP-2 can be linearly separated, providing promising degradation guidance for image restoration. Specifically, the Feature Difference Instruction (FDI) is first extracted by Q-Formers through a simple subtraction operation based on reference image pairs. Then, we propose a multi-scale FDI adapter to decouple the degradation style and corrupted artifacts, and inject the styleflow exclusively into specific blocks through adapter-tuning, thereby preventing noise interference and eschewing the need for cumbersome weight retraining. In this way, we can train various task-specific adapters according to different degradations, achieving rich detail enhancement in the restoration results. Furthermore, the proposed FDI adapters have attractive properties of practical value, such as composability and generalization ability for all-in-one and mixed-degradation restoration. Extensive experiments under various settings demonstrate that our method has promising repairing quality over 10 image restoration tasks and a wide range of other applications. Codes will be publicly available.
Poster
Sanghyeon Na · Yonggyu Kim · Hyunjoon Lee

[ ExHall D ]

Abstract
Human image generation is a key focus in image synthesis due to its broad applications. However, generating high-quality human images remains challenging because even slight inaccuracies in anatomy, pose, or fine details can compromise visual realism. To address these challenges, we explore Direct Preference Optimization (DPO), a method that trains models to generate images similar to preferred (winning) images while diverging from non-preferred (losing) ones. Conventional DPO approaches typically employ generated images as winning images, which may limit the model's ability to achieve high levels of realism. To overcome this limitation, we propose an enhanced DPO approach that incorporates high-quality real images as winning images, encouraging the model to produce outputs that resemble those real images rather than generated ones. Specifically, our approach, \textbf{HG-DPO} (\textbf{H}uman image \textbf{G}eneration through \textbf{DPO}), employs a novel curriculum learning framework that allows the model to gradually improve toward generating realistic human images, making the training more feasible than attempting the improvement all at once. Furthermore, we demonstrate that HG-DPO effectively adapts to personalized text-to-image tasks, generating high-quality, identity-specific images, which highlights the practical value of our approach.
Poster
Zhendong Wang · Jianmin Bao · Shuyang Gu · Dong Chen · Wengang Zhou · Houqiang Li

[ ExHall D ]

Abstract
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
Poster
Senmao Li · Lei Wang · Kai Wang · Tao Liu · Jiehang Xie · Joost van de Weijer · Fahad Shahbaz Khan · Shiqi Yang · Yaxing Wang · Jian Yang

[ ExHall D ]

Abstract
Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps.Based on these observations, we introduce the first Time-independent Unified Encoder (TiUE) for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
Poster
Boming Miao · Chunxiao Li · Xiaoxiao Wang · Andi Zhang · Rui Sun · Zizhe Wang · Yao Zhu

[ ExHall D ]

Abstract
Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
Poster
Jian Jin · Zhenbo Yu · Yang Shen · Zhenyong Fu · Jian Yang

[ ExHall D ]

Abstract
Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.
Poster
Soobin Um · Jong Chul Ye

[ ExHall D ]

Abstract
We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of *text-conditional* data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.
Poster
Kyungmin Jo · Jooyeol Yun · Jaegul Choo

[ ExHall D ]

Abstract
While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts.In this paper, we identify two common issues in existing methods of modifying self-attention that hinder diffusion models from reflecting the image prompt. By addressing these issues, we propose a novel method that generates images that properly reflect the details of image prompts. First, existing approaches often neglect the importance of image prompts in classifier-free guidance, which directs the model towards the intended conditions and away from those undesirable. Specifically, current methods use image prompts as both desired and undesired conditions, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt.In …
Poster
Zijing Hu · Fengda Zhang · Long Chen · Kun Kuang · Jiahui Li · Kaifeng Gao · Jun Xiao · Xin Wang · Wenwu Zhu

[ ExHall D ]

Abstract
Diffusion-based models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named B2-DiffuRL, employs two strategies: **B**ackward progressive training and **B**ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of the denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty associated with sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. …
Poster
Lingjie Kong · Kai WU · Chengming Xu · Xiaobin Hu · Wenhui Han · Jinlong Peng · Donghao Luo · Mengtian Li · Jiangning Zhang · Chengjie Wang · Yanwei Fu

[ ExHall D ]

Abstract
Recent advances in diffusion-based text-to-image models have simplified creating high-fidelity images, but preserving the identity (ID) of specific elements, like a personal dog, is still challenging.Object customization, using reference images and textual descriptions, is key to addressing this issue. Current object customization methods are either object-specific, requiring extensive fine-tuning, or object-agnostic, offering zero-shot customization but limited to specialized domains. The primary issue of promoting zero-shot object customization from specific domains to the general domain is to establish a large-scale general ID dataset for model pre-training, which is time-consuming and labor-intensive. In this paper, we propose a novel pipeline to construct a large dataset of general objects and build the Multi-Category ID-Consistent (MC-IDC) dataset, featuring 315k text-image samples across 10k categories. With the help of MC-IDC, we introduce Customizing Anything (CustAny), a zero-shot framework that maintains ID fidelity and supports flexible text editing for general objects. CustAny features three key components: a general ID extraction module, a dual-level ID injection module, and an ID-aware decoupling module, allowing it to customize any object from a single reference image and text prompt. Experiments demonstrate that CustAny outperforms existing methods in both general object customization and specialized domains like human customization and virtual try-on. …
Poster
Yuyang Peng · Shishi Xiao · Keming Wu · Qisheng Liao · Bohan CHEN · Kevin Lin · Danqing Huang · Ji Li · Yuhui Yuan

[ ExHall D ]

Abstract
Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e.,Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on …
Poster
Taeyoung Yun · Dinghuai Zhang · Jinkyoo Park · Ling Pan

[ ExHall D ]

Abstract
Recent advances in text-to-image diffusion models have demonstrated impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors.To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation.However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded …
Poster
Xiaomin Li · yixuan liu · Takashi Isobe · Xu Jia · Qinpeng Cui · Dong Zhou · Dong Li · You He · Huchuan Lu · Zhongdao Wang · Emad Barsoum

[ ExHall D ]

Abstract
In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal.In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings.We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities.For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideCrafter2, resulting in consistent performance improvements across the board. Code and learned negative embeddings will be released.
Poster
Zehuan Huang · Yuanchen Guo · Xingqiao An · Yunhan Yang · Yangguang Li · Zi-Xin Zou · Ding Liang · Xihui Liu · Yan-Pei Cao · Lu Sheng

[ ExHall D ]

Abstract
This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
Poster
Yuchao Gu · Yipin Zhou · Yunfan Ye · Yixin Nie · Licheng Yu · Pingchuan Ma · Kevin Qinghong Lin · Mike Zheng Shou

[ ExHall D ]

Abstract
Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.
Poster
Hanzhe Hu · Tianwei Yin · Fujun Luan · Yiwei Hu · Hao Tan · Zexiang Xu · Sai Bi · Shubham Tulsiani · Kai Zhang

[ ExHall D ]

Abstract
We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator, and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.
Poster
Zhipeng Huang · Shaobin Zhuang · Canmiao Fu · Binxin Yang · Ying Zhang · Chong Sun · Chen Li · Yali Wang · Zhizheng Zhang · Zheng-Jun Zha

[ ExHall D ]

Abstract
Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce ChatGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable ChatGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in ChatGen and show it achieves state-of-the-art performance across various visual …
Poster
Ronghuan Wu · Wanchao Su · Jing Liao

[ ExHall D ]

Abstract
Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
Poster
Sohan Patnaik · Rishabh Jain · Balaji Krishnamurthy · Mausoom Sarkar

[ ExHall D ]

Abstract
Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment (AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16% improvement over current State-of-The-Art methods, thereby highlighting the …
Poster
Andreas Müller · Denis Lukovnikov · Jonas Thietke · Asja Fischer · Erwin Quiring

[ ExHall D ]

Abstract
Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.
Poster
Feng Zhou · Ruiyang Liu · chen liu · Gaofeng He · Yonglu Li · Xiaogang Jin · Huamin Wang

[ ExHall D ]

Abstract
Sewing patterns, the essential blueprints for fabric cutting and tailoring, act as a crucial bridge between design concepts and producible garments.However, existing uni-modal sewing pattern generation models struggle to effectively encode complex design concepts with a multi-modal nature and correlate them with vectorized sewing patterns that possess precise geometric structures and intricate sewing relations.In this work, we propose a novel sewing pattern generation approach Design2GarmentCode based on Large Multimodal Models (LMMs), to generate parametric pattern-making programs from multi-modal design concepts.LMM offers an intuitive interfaces for interpreting diverse design inputs, while pattern-making programs could serve as well-structured and semantically meaningful representations of sewing patterns, and act as a robust bridge connecting the cross-domain pattern-making knowledge embedded in LMMs with vectorized sewing patterns.Experimental results demonstrate that our method can flexibly handle various complex design expressions such as images, textual descriptions, designer sketches, or their combinations, and convert them into size-precise sewing patterns with correct stitches. Compared to previous methods, our approach significantly enhances training efficiency, generation quality, and authoring flexibility. Our code and data will be publicly available.
Poster
Xinghui Li · Qichao Sun · Pengze Zhang · Fulong Ye · Zhichao Liao · Wanquan Feng · Songtao Zhao · Qian HE

[ ExHall D ]

Abstract
Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing primarily comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as …
Poster
Junying Wang · Hongyuan Zhang · Yuan Yuan

[ ExHall D ]

Abstract
Recent personalized portrait generation methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and customized portrait generation, we develop a multi-modal image customizer capable of generating controllable fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into customized portrait generation. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
Poster
Fernando Julio Cendra · Kai Han

[ ExHall D ]

Abstract
The inherent ambiguity in the definition of visual concepts poses significant challenges for modern generative models, like the Text-to-Image (T2I) models based on diffusion models, in accurately learning concepts from the input images. Existing methods lack a systematic framework and interpretative mechanisms, hindering reliable extraction of the underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework to automatically and systematically extract intrinsic concepts from a single image leveraging a T2I model. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module that pinpoints relevant text-based concepts and their corresponding masks within a given image. This critical phase not only streamlines concept initialization but also offers precise guidance for the subsequent analysis. The second stage delves deeper into each identified mask, decomposing concepts into intrinsic components, capturing specific visual characteristics and general components representing broader categories. This decomposition facilitates a more granular understanding by further dissecting concepts into detailed intrinsic attributes such as colour and material. Extensive experiments validate that ICE achieves superior performance on intrinsic concept extraction, enabling reliable and flexible application to downstream tasks like personalized image generation, image editing, and so on. …
Poster
Sangwon Jung · Alex Oesterling · Claudio Mayrink Verdun · Sajani Vithana · Taesup Moon · Flavio Calmon

[ ExHall D ]

Abstract
Text-to-image generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the "safe" and "responsible" design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in large image generation models. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by text-to-image generative models. We propose a novel application of the Multi-Group Proportional Representation (MPR) metric to rigorously evaluate representative harms in image generation and develop an algorithm to optimize generative models for this representational metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.
Poster
Logan Frank · Jim Davis

[ ExHall D ]

Abstract
Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher's original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: "what makes a good dataset for transferring knowledge from teacher to student?" Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code will be available in the future.
Poster
Koushik Srivatsan · Fahad Shamshad · Muzammal Naseer · Vishal M. Patel · Karthik Nandakumar

[ ExHall D ]

Abstract
The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security, because concept-erased models (CEMs) can be easily deceived through adversarial attacks to generate the erased concept. Though some robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding-space attacks. These limitations stem from the failure of robust CEMs to search for “blind spots” in the embedding space thoroughly. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step rather than the only step for robust concept erasure. In the first stage, STEREO employs adversarial training as a vulnerability identification mechanism to search thoroughly enough. In the second robustly erase once stage, STEREO introduces an anchor-concept-based compositional objective to robustly erase the target concept at one go while attempting to minimize the degradation on model utility. We benchmark STEREO against 7 state-of-the-art concept erasure methods, demonstrating its enhanced robustness against whitebox, …
Poster
Xinting Hu · Haoran Wang · Jan Lenssen · Bernt Schiele

[ ExHall D ]

Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion model to generate identity-consistent human-object interaction (HOI) images. While personalized face diffusion (PFD) models have advanced significantly, they often overfit facial features and fail to produce coherent full-body interactions with objects. To address this issue, PersonaHOI introduces an additional StableDiffusion (SD) branch to follow HOI-driven text descriptions in image generation. By incorporating proposed cross-attention constraints in the PFD branch, and spatial fusion strategies between SD and PFD branches at both the latent and residual level, PersonaHOI successfully blends personalized facial details with interactive non-facial regions, ensuring identity preservation and interaction coherence. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation.
Poster
Junxi Chen · Junhao Dong · Xiaohua Xie

[ ExHall D ]

Abstract
Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations.
Poster
Won Jun Kim · Hyungjin Chung · Jaemin Kim · Sangmin Lee · Byeongsu Sim · Jong Chul Ye

[ ExHall D ]

Abstract
Gradient-based methods are a prototypical family of "explainability for AI" (XAI) techniques, especially for image-based models.Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce "Derivative-Free Diffusion Manifold-Contrained Gradients (FreeMCG)", a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs (i.e., in a completely black-box setting). We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks - counterfactual explanation and feature attribution - we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
Poster
Hanhui Wang · Yihua Zhang · Ruizheng Bai · Yue Zhao · Sijia Liu · Zhengzhong Tu

[ ExHall D ]

Abstract
Recent advancements in diffusion models have made generative image editing more accessible than ever. While these developments allow users to generate creative edits with ease, they also raise significant ethical concerns, particularly regarding malicious edits to human portraits that threaten individuals' privacy and identity security. Existing general-purpose image protection methods primarily focus on generating adversarial perturbations to nullify edit effects. However, these approaches often exhibit instability to protect against diverse editing requests. In this work, we introduce a novel perspective to personal human portrait protection against malicious editing. Unlike traditional methods aiming to prevent edits from taking effect, our method, FaceLock, optimizes adversarial perturbations to ensure that original biometric information---such as facial features---is either destroyed or substantially altered post-editing, rendering the subject in the edited output biometrically unrecognizable. Our approach innovatively integrates facial recognition and visual perception factors into the perturbation optimization process, ensuring robust protection against a variety of editing attempts. Besides, we shed light on several critical issues with commonly used evaluation metrics in image editing and reveal cheating methods by which they can be easily manipulated, leading to deceptive assessments of protection. Through extensive experiments, we demonstrate that FaceLock significantly outperforms all baselines in defense performance against …
Poster
Yuechen Xie · Jie Song · Huiqiong Wang · Mingli Song

[ ExHall D ]

Abstract
High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-α, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99\% in determining the provenance of suspicious model training data, surpassing all previous methods. Code will be publicly available soon.
Poster
Haifeng Zhang · Qinghui He · Xiuli Bi · Weisheng Li · Bo Liu · Bin Xiao

[ ExHall D ]

Abstract
The rapid advancement of generative models has significantly improved the quality of generated images. Meanwhile, it challenges information authenticity and credibility. Current generated image detection methods based on large-scale pre-trained multimodal models have achieved impressive results. Although these models provide abundant features, the authentication task-related features are often submerged. Consequently, those authentication task-irrelated features cause models to learn superficial biases, thereby harming their generalization performance across different model genera (e.g., GANs and Diffusion Models). To this end, we proposed VIB-Net, which uses Variational Information Bottlenecks to enforce authentication task-related feature learning. We tested and analyzed the proposed method and existing methods on samples generated by 17 different generative models. Compared to SOTA methods, VIB-Net achieved a 4.62% improvement in mAP and a 9.33% increase in accuracy. Notably, in generalization tests on unseen generative models from different series, VIB-Net improved mAP by 12.48% and accuracy by 23.59% over SOTA methods.
Poster
Qi Bi · Jingjun Yi · Huimin Huang · Hao Zheng · Haolan Zhan · Yawen Huang · Yuexiang Li · Xian Wu · Yefeng Zheng

[ ExHall D ]

Abstract
Night-time scene segmentation is a critical yet challenging task in the real-world applications, primarily due to the complicated lighting conditions. However, existing methods lack sufficient generalization ability to unseen nigh-time scenes with varying illumination.In light of this issue, we focus on investigating generalizable paradigms for night-time scene segmentation and propose an efficient fine-tuning scheme, dubbed as \texttt{NightAdapter}, alleviating the domain gap across various scenes.Interestingly, different properties embedded in the day-time and night-time features can be characterized by the bands after discrete sine transformation, which can be categorized into illumination-sensitive/-insensitive bands.Hence, our \texttt{NightAdapter} is powered by two appealing designs: (1) Illumination-Insensitive Band Adaptation that provides a foundation for understanding the prior, enhancing the robustness to illumination shifts; (2) Illumination-Sensitive Band Adaptation that fine-tunes the randomized frequency bands, mitigating the domain gap between the day-time and various night-time scenes. As a consequence, illumination-insensitive enhancement improves the domain invariance, while illumination-sensitive diminution strengthens the domain shift between different scenes.\texttt{NightAdapter} yields significant improvements over the state-of-the-art methods under various day-to-night, night-to-night, and in-domain night segmentation experiments.We will release our code.
Poster
Yongqi Yang · Zhihao Qian · Ye Zhu · Olga Russakovsky · Yu Wu

[ ExHall D ]

Abstract
The boom of Generative AI brings opportunities entangled with risks and concerns.Existing literature emphasizes the generalization capability of deepfake detection on unseen generators, significantly promoting the detector's ability to identify more universal artifacts.In this work, we seek a step toward a universal deepfake detection system with better generalization and robustness. We do so by first scaling up the existing detection task setup from the one-generator to multiple-generators in training, during which we disclose two challenges presented in prior methodological designs and demonstrate the divergence of detectors' performance.Specifically, we reveal that the current methods tailored for training on one specific generator either struggle to learn comprehensive artifacts from multiple generators or sacrifice their fitting ability for seen generators (i.e., _In-Domain_ (ID) performance) to exchange the generalization for unseen generators (i.e., _Out-Of-Domain_ (OOD) performance). And detectors' similar performance will diverge during the scaling up of generators.To tackle the above challenges, we propose our **D**iscrepancy **D**eepfake **D**etector (**D3**) framework, whose core idea is to deconstruct the universal artifacts from multiple generators by introducing a parallel network branch that takes a distorted image feature as extra discrepancy signal and supplement its original counterpart. Extensive scaled-up experiments demonstrate the effectiveness of **D3**, achieving 5.3\% accuracy …
Poster
Feng Yan · Xiaoheng Jiang · Yang Lu · Jiale Cao · Dong Chen · Mingliang Xu

[ ExHall D ]

Abstract
As an important part of intelligent manufacturing, pixel-level surface defect detection (SDD) aims to locate defect areas through mask prediction. Previous methods adopt the image-independent static convolution to indiscriminately classify per-pixel features for mask prediction, which leads to suboptimal results for some challenging scenes such as weak defects and cluttered backgrounds. In this paper, inspired by query-based methods, we propose a Wavelet and Prototype Augmented Query-based Transformer (WPFormer) for surface defect detection. Specifically, a set of dynamic queries for mask prediction is updated through the dual-domain transformer decoder. Firstly, a Wavelet-enhanced Cross-Attention (WCA) is proposed, which aggregates meaningful high- and low-frequency information of image features in the wavelet domain to refine queries. WCA enhances the representation of high-frequency components by capturing relationships between different frequency components, enabling queries to focus more on defect details. Secondly, a Prototype-guided Cross-Attention (PCA) is proposed to refine queries through meta-prototypes in the spatial domain. The prototypes aggregate semantically meaningful tokens from image features, facilitating queries to aggregate crucial defect information under the cluttered backgrounds. Extensive experiments on three defect detection datasets (i.e., ESDIs-SOD, CrackSeg9k, and ZJU-Leaper) demonstrate that the proposed method achieves state-of-the-art performance in defect detection.
Poster
Zhanqiang Guo · Jiamin Wu · Yonghao Song · Jiahui Bu · Weijian Mai · Qihao Zheng · Wanli Ouyang · Chunfeng Song

[ ExHall D ]

Abstract
Human's perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images. Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder. To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and …
Poster
Sahar Dastani · Ali Bahri · Moslem Yazdanpanah · Mehrdad Noori · David OSOWIECHI · Gustavo Vargas Hakim · Farzad Beizaee · Milad Cheraghalikhani · Arnab Mondal · Herve Lombaert · Christian Desrosiers

[ ExHall D ]

Abstract
State Space Models (SSMs) have recently emerged as an alternative to Vision Transformers (ViTs) due to their unique ability of modeling global relationships with linear complexity. SSMs are specifically designed to capture spatially proximate relationships of image patches. However, they fail to identify relationships between conceptually related yet not adjacent patches. This limitation arises from the non-causal nature of image data, which lacks inherent directional relationships. Additionally, current vision-based SSMs are highly sensitive to transformations such as rotation. Their predefined scanning directions depend on the original image orientation, which can cause the model to produce inconsistent patch-processing sequences after rotation.To address these limitations, we introduce Spectral VMamba, a novel approach that effectively captures the global structure within an image by leveraging spectral information derived from the graph Laplacian of image patches. Through spectral decomposition, our approach encodes patch relationships independently of image orientation, achieving rotation invariance with the aid of our Rotational Feature Normalizer (RFN) module. Our experiments on classification tasks show that Spectral VMamba outperforms the leading SSM models in vision, such as VMamba, while maintaining invariance to rotations and a providing a similar runtime efficiency.
Poster
Yihua Cheng · Hengfei Wang · Zhongqun Zhang · Yang Yue · Boeun Kim · Feng Lu · Hyung Jin Chang

[ ExHall D ]

Abstract
3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for …
Poster
Toby Perrett · Ahmad Darkhalil · Saptarshi Sinha · Omar Emara · Sam Pollard · Kranti Kumar Parida · Kaiting Liu · Prajwal Gatti · Siddhant Bansal · Kevin Flanagan · Jacob Chalk · Zhifan Zhu · Rhodri Guerrier · Fahd Abdelazim · Bin Zhu · Davide Moltisanti · Michael Wray · Hazel Doughty · Dima Damen

[ ExHall D ]

Abstract
We present a validation dataset of newly-collected kitchen based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HD-EPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 37.0% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 404 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per min of our unscripted videos.
Poster
Fan Qi · KunSheng Ma · Changsheng Xu

[ ExHall D ]

Abstract
Recent advancements in latent diffusion models (LDMs) have led to innovative approaches in music generation, allowing for increased flexibility and integration with other modalities. However, existing methods often rely on a two-step process that fails to capture the artistic essence of videos, particularly in the context of complex videos requiring detailed sound effect and diverse instrumentation. In this paper, we propose a novel framework for generating video soundtracks that simultaneously produces music and sound effect tailored to the video content. Our method incorporates a Contrastive Visual-Sound-Music pretraining process that maps these modalities into a unified feature space, enhancing the model's ability to capture intricate audio dynamics. We design Spectrum Divergence Masked Attention for Unet to differentiate between the unique characteristics of sound effect and music. We utilize Score-guided Noise Iterative Optimization to provide musicians with customizable control during the generation process. Extensive evaluations on the FilmScoreDB and SymMV\&HIMV datasets demonstrate that our approach significantly outperforms state-of-the-art baselines in both subjective and objective assessments, highlighting its potential as a robust tool for video soundtrack generation.
Poster
Chao Huang · Ruohan Gao · J. M. F. Tsang · Jan Kurcius · Cagdas Bilen · Chenliang Xu · Anurag Kumar · Sanjeel Parekh

[ ExHall D ]

Abstract
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset--the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process---separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Readers are encouraged to see video results in supplements.
Poster
Anna Min · Ziyang Chen · Hang Zhao · Andrew Owens

[ ExHall D ]

Abstract
We present a method for learning binaural sound localization from ego-motion in videos. When the camera moves in a video, the direction of sound sources will change along with it. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this idea, we propose a dataset of real-world audio-visual videos with ego-motion. We show that our model can successfully learn from this real-world data, and that it obtains strong performance on sound localization tasks.
Poster
Abduljalil Radman · Jorma Laaksonen

[ ExHall D ]

Abstract
Referring audio-visual segmentation (Ref-AVS) aims to segment objects within audio-visual scenes using multimodal cues embedded in text expressions. While the Segment Anything Model (SAM) has revolutionized visual segmentation, its applicability to Ref-AVS, where multimodal cues act as novel prompts, remains unexplored. SAM’s limitation to single-frame segmentation also hinders its ability to capture essential temporal context needed for multi-frame audio-visual segmentation. To address this gap, we propose TSAM, a novel extension of SAM designed to leverage multimodal cues for precise segmentation in dynamic audio-visual scenes. TSAM enhances SAM’s image encoder with a temporal modeling branch, enabling spatio-temporal learning and deep multimodal fusion across video frames, while retaining SAM’s pre-trained knowledge. Additionally, TSAM replaces SAM’s user-interactive prompting mechanism with sparse and dense data-driven prompts, enabling more effective integration of audio-visual inputs and reference text expressions. Extensive experiments on the Ref-AVS dataset demonstrate the superiority of our proposed TSAM over state-of-the-art methods, underscoring its effectiveness in accurately segmenting objects in audio-visual scenes guided by text-based multimodal cues and its strong generalization to unseen objects.
Poster
Liang Liu · Shuaiyong Li · Yongqiang Zhu

[ ExHall D ]

Abstract
Audio-visual event localization (AVEL) involves identifying the category and the corresponding temporal boundary of an event that is both audible and visible in unconstrained videos. However, the semantic gap between heterogeneous modalities often leads to audio-visual semantic inconsistency. In this paper, we propose a novel Audio-Visual Semantic Graph Network (AVSGN) to facilitate cross-modal alignment and cross-temporal interaction. Unlike previous methods (e.g., audio-guided, visual-guided, or both), we introduce shared semantic textual labels to bridge the semantic gap between audio and visual modalities. Specifically, we present a cross-modal semantic alignment (CMSA) module to explore the cross-modal complementary relationships across heterogeneous modalities (i.e., visual, audio and text), promoting the convergence of multimodal distributions into a common semantic space. Additionally, in order to capture cross-temporal associations sufficiently, we devise a cross-modal graph interaction (CMGI) module, which disentangles complicated interactions across modalities into three complementary subgraphs. Extensive experiments on the AVE dataset comprehensively demonstrate the superiority and effectiveness of the proposed model in both fully- and weakly-supervised AVE settings.
Poster
Huangbiao Xu · Xiao Ke · Huanqi Wu · Rui Xu · Yuezhou Li · Wenzhong Guo

[ ExHall D ]

Abstract
Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations as well as action-music coordination. However, there is no direct correlation between the diverse background music and movements in sporting events. Previous works require larger model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learning (MLAVL) framework that models audio-action-visual correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowledge graphs, motivating audio-visual modalities to focus on task-relevant actions. We further design a shared-specific context encoder to integrate deep multimodal semantics, and an audio-visual cross-modal fusion module to evaluate action-music consistency. To match the sport's rules, we then propose a dual-branch prompt-guided grading module to weigh both visual and audio-visual performance. Extensive experiments demonstrate that our approach achieves state-of-the-art on four public long-term sports benchmarks while maintaining low parameters. Our code will be available.
Poster
Shuai Tan · Biao Gong · Yutong Feng · Kecheng Zheng · DanDan Zheng · Shuwei Shi · Yujun Shen · Jingdong Chen · Ming Yang

[ ExHall D ]

Abstract
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of our approach in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. The code and models will be made publicly available.
Poster
Ziyi Wu · Aliaksandr Siarohin · Willi Menapace · Ivan Skorokhodov · Yuwei Fang · Varnith Chordia · Igor Gilitschenski · Sergey Tulyakov

[ ExHall D ]

Abstract
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order.To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation.By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events.For the first time in the literature, our model offers control over the timing of events in generated videos.Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.Additional results and details are available on our website in the supplementary material.
Poster
Kun Liu · Qi Liu · Xinchen Liu · Jie Li · Yongdong Zhang · Jiebo Luo · Xiaodong He · Wu Liu

[ ExHall D ]

Abstract
Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first large-scale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation.
Poster
Duowang Zhu · Xiaohu Huang · Haiyan Huang · Hao Zhou · Zhenfeng Shao

[ ExHall D ]

Abstract
In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light …
Poster
Darryl Ho · Samuel Madden

[ ExHall D ]

Abstract
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of …
Poster
Yang Liu · Qianqian Xu · Peisong Wen · Siran Dai · Qingming Huang

[ ExHall D ]

Abstract
The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available in the Supplementary Material.
Poster
Rui Qian · Shuangrui Ding · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang

[ ExHall D ]

Abstract
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a solution built on a Disentangled Perception, Decision, and Reaction framework. Dispider features a lightweight Proactive Streaming Video Processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous Precise Interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments prove that Dispider …
Poster
Haitong Liu · Kuofeng Gao · Yang Bai · Jinmin Li · Jinxiao Shan · Tao Dai · Shu-Tao Xia

[ ExHall D ]

Abstract
Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named **Ramblings** and **Mutes**. Concretely, **Ramblings** aim to mislead video-based LLMs into generating inaccurate captions for the original videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. **Mutes**, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content.
Poster
Zijia Lu · ASM Iftekhar · Gaurav Mittal · Tianjian Meng · Xiawei Wang · Cheng Zhao · Rohith Kukkala · Ehsan Elhamifar · Mei Chen

[ ExHall D ]

Abstract
Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing "delegate-and-conquer" strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Code and model will be released upon acceptance.
Poster
Chan Hur · Jeong-hun Hong · Dong-hun Lee · Dabin Kang · Semin Myeong · Sang-hyo Park · Hyeyoung Park

[ ExHall D ]

Abstract
In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets. The code will be available at [github]
Poster
weixing chen · Yang Liu · Binglin Chen · Jiandong Su · Yongsen Zheng · Liang Lin

[ ExHall D ]

Abstract
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, although large models possess extensive prior knowledge and can demonstrate strong performance in a zero-shot setting, issues such as spurious correlations persist, making their application to specific downstream tasks challenging. In this work, we propose a novel causality-ware VideoQG framework named Cross-modal Causality Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Attention Grounding (GSAG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter. ii) Cross-modal Alignment (CA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features. iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving …
Poster
Luca Zanella · Massimiliano Mancini · Willi Menapace · Sergey Tulyakov · Yiming Wang · Elisa Ricci

[ ExHall D ]

Abstract
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first …
Poster
Chaoyou Fu · Yuhan Dai · Yongdong Luo · Lei Li · Shuhuai Ren · Renrui Zhang · Zihan Wang · Chenyu Zhou · Yunhang Shen · Mengdan Zhang · Peixian Chen · Yanwei Li · Shaohui Lin · Sirui Zhao · Ke Li · Tong Xu · Xiawu Zheng · Enhong Chen · Caifeng Shan · Ran He · Xing Sun

[ ExHall D ]

Abstract
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs to process sequential visual data is still insufficiently explored, highlighting the lack of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, and reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75\%, compared to 71.9% for …
Poster
Jinhui Yi · Syed Talal Wasim · Yanan Luo · Muzammal Naseer · Jürgen Gall

[ ExHall D ]

Abstract
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5× reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4× faster processing speeds than previous methods.
Poster
Chiara Plizzari · Alessio Tonioni · Yongqin Xian · Achin Kulshrestha · Federico Tombari

[ ExHall D ]

Abstract
Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input.To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics in egocentric settings.The dataset will be made publicly available upon acceptance.
Poster
Quan Zhang · Jinwei Fang · Rui Yuan · Xi Tang · Yuxin Qi · Ke Zhang · Chun Yuan

[ ExHall D ]

Abstract
Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic textual cues for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.
Poster
Reno Kriz · Kate Sanders · David Etter · Kenton Murray · Cameron Carpenter · Hannah Recknor · Jimena Guallar-Blasco · Alexander Martin · Eugene Yang · Benjamin Van Durme

[ ExHall D ]

Abstract
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce \textbf{MultiVENT 2.0}, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and over 3,900 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.
Poster
Zijia Zhao · Yuqi Huo · Tongtian Yue · Longteng Guo · Haoyu Lu · Bingning Wang · Weipeng Chen · Jing Liu

[ ExHall D ]

Abstract
Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce **EMA**, an **E**fficient **M**otion-**A**ware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
Poster
Zeyi Huang · Yuyang Ji · Xiaofang Wang · Nikhil Mehta · Tong Xiao · Donghyun Lee · Sigmund VanValkenburgh · Shengxin Zha · Bolin Lai · Licheng Yu · Ning Zhang · Yong Jae Lee · Miao Liu

[ ExHall D ]

Abstract
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
Poster
Jiawei Tan · Hongxing Wang · Junwu Weng · Jiaxin Li · Zhilong Ou · Kang Dang

[ ExHall D ]

Abstract
Video moment retrieval aims to locate specific moments from a video according to the query text. This task presents two main challenges: i) aligning the query and video frames at the feature level, and ii) projecting the query-aligned frame features to the start and end boundaries of the matching interval. Previous work commonly involves all frames in feature alignment, easy to cause aligning irrelevant frames with the query. Furthermore, they forcibly map visual features to interval boundaries but ignoring the information gap between them, yielding suboptimal performance. In this study, to reduce distraction from irrelevant frames, we designate an anchor frame as that with the maximum query-frame relevance measured by the established Vision-Language Model. Via similarity comparison between the anchor frame and the others, we produce a semantically compact segment around the anchor frame, which serves as a guide to align features of query and related frames. We observe that such a feature alignment will make similarity cohesive between target frames, which enables us to predict the interval boundaries by a single point detection in the 2D semantic similarity space of frames, thus well bridging the information gap between frame semantics and temporal boundaries. Experimental results across various datasets demonstrate …
Poster
Yisen Feng · Haoyu Zhang · Meng Liu · Weili Guan · Liqiang Nie

[ ExHall D ]

Abstract
Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach.Our code will be released and made available in the supplementary material.
Poster
Aditya Chinchure · Sahithya Ravi · Raymond Ng · Vered Shwartz · Boyang Li · Leonid Sigal

[ ExHall D ]

Abstract
The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations …
Poster
Hesham Syed · Yun Liu · Guolei Sun · Henghui Ding · Jing Yang · Ender Konukoglu · Xue Geng · Xudong Jiang

[ ExHall D ]

Abstract
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code will be released.
Poster
Jaewoo Jeong · Seohee Lee · Daehee Park · Giwon Lee · Kuk-Jin Yoon

[ ExHall D ]

Abstract
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. Their camera-based visual features enable the extraction of additional modalities (human pose, text) which enhance prediction accuracy. We focus on pedestrian motion prediction to fully utilize the rich, dynamic visual features of pedestrians. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires an use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. We validate our generalizable framework with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups. For the SIT dataset, we utilize VLM to generate captions to compensate for the lack of text annotations. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations.
Poster
Mingqiao Ye · Seoung Wug Oh · Lei Ke · Joon-Young Lee

[ ExHall D ]

Abstract
Automatically tracking and segmenting every video entity remains a significant challenge. Despite rapid advancements in video segmentation, even state-of-the-art models like SAM 2 struggle to consistently track all entities across a video—a task we refer to as Video Entity Segmentation.We propose EntitySAM, a framework for zero-shot video entity segmentation. EntitySAM extends SAM 2 by removing the need for explicit prompts, allowing automatic discovery and tracking of all entities, including those appearing in later frames. We incorporate query-based entity discovery and association into SAM 2, inspired by transformer-based object detectors. Specifically, we introduce an entity decoder to facilitate inter-object communication and an automatic prompt generator using learnable object queries. Additionally, we add a semantic encoder to enhance SAM 2's semantic awareness, improving segmentation quality. Trained on image-level mask annotations without category information from the COCO dataset, EntitySAM demonstrates strong generalization on four zero-shot video segmentation tasks: Video Entity, Panoptic, Instance, and Semantic Segmentation. Results on six popular benchmarks show that EntitySAM outperforms previous unified video segmentation methods and strong baselines, setting new standards for zero-shot video segmentation.
Poster
Md Zarif Hossain · AHMED IMTEAJ

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) have emerged as transformative tools in multimodal tasks, seamlessly integrating pretrained vision encoders to align visual and textual modalities. Prior works have highlighted the susceptibility of LVLMs to dual exploits (gradient-based and optimization-based jailbreak attacks), which leverage the expanded attack surface introduced by the image modality. Despite advancements in enhancing robustness, existing methods fall short in their ability to defend against dual exploits while preserving fine-grained semantic details and overall semantic coherence under intense adversarial perturbations. To bridge this gap, we introduce SLADE, a novel unsupervised adversarial fine-tuning scheme that enhances the resilience of CLIP-based vision encoders. SLADE’s dual-level contrastive learning approach balances the granular and the holistic, capturing fine-grained image details without losing sight of high-level semantic coherence. Extensive experiments demonstrate that SLADE-equipped LVLMs set a new benchmark for robustness against dual exploits while preserving fine-grained semantic details of perturbed images. Notably, SLADE achieves these results without compromising the core functionalities of LVLMs, such as instruction following, or requiring the computational overhead (e.g., large batch sizes, momentum encoders) commonly associated with traditional contrastive learning methods. The code is provided in the supplementary material with this submission.
Poster
Alan Lukezic · Jovana Videnović · Matej Kristan

[ ExHall D ]

Abstract
Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
Poster
Snehashis Majhi · Giacomo D'Amicantonio · Antitza Dantcheva · Quan Kong · Lorenzo Garattoni · Gianpiero Francesca · Egor Bondarev · Francois Bremond

[ ExHall D ]

Abstract
Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: PI-VAD (or π-VAD), a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. π-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, π-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones …
Poster
Kazi Sajeed Mehrab · M. Maruf · Arka Daw · Abhilash Neog · Harish Babu Manogaran · Mridul Khurana · Zhenyang Feng · Bahadir Altintas · Yasin Bakis · Elizabeth Campolongo · Matthew Thompson · Xiaojun Wang · Hilmar Lapp · Tanya Berger-Wolf · Paula Mabee · Henry Bart · Wei-Lun Chao · Wasla Dahdul · Anuj Karpatne

[ ExHall D ]

Abstract
The availability of large datasets of organism images combined with advances in artificial intelligence (AI) has significantly enhanced the study of organisms through images, unveiling biodiversity patterns and macro-evolutionary trends. However, existing machine learning (ML)-ready organism datasets have several limitations. First, these datasets often focus on species classification only, overlooking tasks involving visual traits of organisms. Second, they lack detailed visual trait annotations, like pixel-level segmentation, that are crucial for in-depth biological studies. Third, these datasets predominantly feature organisms in their natural habitats, posing challenges for aquatic species like fish, where underwater images often suffer from poor visual clarity, obscuring critical biological traits. This gap hampers the study of aquatic biodiversity patterns which is necessary for the assessment of climate change impacts, and evolutionary research on aquatic species morphology. To address this, we introduce the Fish-Visual Trait Analysis (Fish-Vista) dataset—a large, annotated collection of about 80K fish images spanning 3000 different species, supporting several challenging and biologically relevant tasks including species classification, trait identification, and trait segmentation. These images have been curated through a sophisticated data processing pipeline applied to a cumulative set of images obtained from various museum collections. Fish-Vista ensures that visual traits of images are clearly visible, …
Poster
Ho-Joong Kim · Yearang Lee · Jung-Ho Hong · Seong-Whan Lee

[ ExHall D ]

Abstract
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment.
Poster
Dominick Reilly · Rajatsubhra Chakraborty · Arkaprava Sinha · Manish Kumar Govind · Pu Wang · Francois Bremond · Le Xue · Srijan Das

[ ExHall D ]

Abstract
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks.Code and data will be made publicly available at https://llavidal.github.io/llavidal/
Poster
Jianyang Xie · Yitian Zhao · Yanda Meng · He Zhao · Anh Nguyen · Yalin Zheng

[ ExHall D ]

Abstract
Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code will be released upon acceptance.
Poster
Yuhao Li · Xinyue Chen · Hongkai Li · Xiaorong Pu · Peng Jin · Yazhou Ren

[ ExHall D ]

Abstract
Sign language is a visual language expressed through complex movements of the upper body. The human skeleton plays a critical role in sign language recognition due to its good separation from the video background. However, mainstream skeleton-based sign language recognition models often overly focus on the natural connections between joints, treating sign language as ordinary human movements, which neglects its linguistic characteristics. We believe that just as letters form words, each sign language gloss can also be decomposed into smaller visual symbols. To fully harness the potential of skeleton data, this paper proposes a novel joint fusion strategy and a visual symbol attention model. Specifically, we first input the complete set of skeletal joints, and after dynamically exchanging joint information, we discard the parts with the weakest connections to other joints, resulting in a fused, simplified skeleton. Then, we group the joints most likely to express the same visual symbol and discuss the joint movements within each group separately. To validate the superiority of our method, we conduct extensive experiments on multiple public benchmark datasets. The results show that, without complex pre-training, we still achieve new state-of-the-art performance.
Poster
Chun Tong Lei · Hon Ming Yam · Zhongliang Guo · Yifei Qian · Chun Pong Lau

[ ExHall D ]

Abstract
Neural networks have revolutionized numerous fields with their exceptional performance, yet they remain susceptible to adversarial attacks through subtle perturbations. While diffusion-based purification methods like DiffPure offer promising defense mechanisms, their computational overhead presents a significant practical limitation.In this paper, we introduce One Step Control Purification (OSCP), a novel defense framework that achieves robust adversarial purification in a single Neural Function Evaluation (NFE) within diffusion models.We propose Gaussian Adversarial Noise Distillation (GAND) as the distillation objective and Controlled Adversarial Purification (CAP) as the inference pipeline, which makes OSCP demonstrate remarkable efficiency while maintaining defense efficacy.Our proposed GAND addresses a fundamental tension between consistency distillation and adversarial perturbation, bridging the gap between natural and adversarial manifolds in the latent space, while remaining computationally efficient through Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, eliminating the high computational budget request from full parameter fine-tuning.The CAP guides the purification process through the unlearnable edge detection operator calculated by the input image as an extra prompt, effectively preventing the purified images from deviating from their original appearance when using large purification steps.Our experimental results on ImageNet showcase OSCP's superior performance, achieving a 74.19\% defense success rate with merely 0.1s per purification --- a 100-fold speedup …
Poster
Huu Binh Ta · Duc Nguyen · Quyen Tran · Toan Tran · Tung Pham

[ ExHall D ]

Abstract
In security-sensitive fields, data should be encrypted to protect against unauthorized access and maintain confidentiality throughout processing. However, traditional networks like ViTs and CNNs return different results when processing original data versus its encrypted form, meaning that they require data to be decrypted, posing a security risk by exposing sensitive information. One solution for this issue is using polynomial networks, including state-of-the-art Multilinear Operator Networks, which return the same outputs given the real data and their encrypted forms under Leveled Fully Homomorphic Encryption. Nevertheless, these models are susceptible to catastrophic forgetting in incremental learning settings. Thus this paper will present a new low-rank adaptation method combined with the Gradient Projection Memory mechanism to minimize the issue. Our proposal is compatible with Leveled Fully Homomorphic Encryption while achieving a sharp improvement in performance compared to existing models.
Poster
Zhuowei Li · Tianchen Zhao · Xiang Xu · Zheng Zhang · Zhihua Li · Xuanbai Chen · Qin ZHANG · Alessandro Bergamo · Anil Kumar Jain · Yifan Xing

[ ExHall D ]

Abstract
Developing a face anti-spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and the diverse end-user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share large amount of their face data with service providers. In this work, we introduce a novel method where the face anti-spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data, while keeping model parameters and training data inaccessible to the client. We develop a prototype-based base model and an optimal transport-guided adaptor that enable adaptation either in a light-weight training or training-free setting, without updating the base model's parameters. Moreover, we employ geodesic mixup, an optimal transport-based synthesis method that generates augmented training data along the geodesic path between source prototypes and the target data distribution. This allows training a lightweight classifier to effectively adapt to target-specific characteristics while retaining essential knowledge learned from the source domain. In cross-domain and cross-attack setting, compared with recent methods, our method achieves average improvements of 19.17\% in HTER and 8.58\% in AUC, respectively.
Poster
Gaojian Wang · Feng Lin · Tong Wu · Zhenguang Liu · Zhongjie Ba · Kui Ren

[ ExHall D ]

Abstract
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region Consistency and challenging inter-region Coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global Correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision Foundation Model for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
Poster
Hangtao Zhang · Yichen Wang · Shihui Yan · Chenyu Zhu · Ziqi Zhou · Linshan Hou · Shengshan Hu · Minghui Li · Yanjun Zhang · Leo Yu Zhang

[ ExHall D ]

Abstract
Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection---particularly its output of numerous objects---pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.
Poster
Tong Bu · Maohua Li · Zhaofei Yu

[ ExHall D ]

Abstract
Spiking Neural Networks (SNNs) have emerged as a promising substitute for Artificial Neural Networks (ANNs) due to their advantages of fast inference and low power consumption. However, the lack of efficient training algorithms has hindered their widespread adoption. Even efficient ANN-SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN-SNN conversion framework with only inference scale complexity. The conversion framework includes a local threshold balancing algorithm, which enables efficient calculation of the optimal thresholds and fine-grained adjustment of the threshold value by channel-wise scaling. We also introduce an effective delayed evaluation strategy to mitigate the influence of the spike propagation delays. We demonstrate the scalability of our framework in typical computer vision tasks: image classification, semantic segmentation, object detection, and video classification. Our algorithm outperforms existing methods, highlighting its practical applicability and efficiency. Moreover, we have evaluated the energy consumption of the converted SNNs, demonstrating their superior low-power advantage compared to conventional ANNs. This approach simplifies the deployment of SNNs by leveraging open-source pre-trained ANN models, enabling fast, low-power inference with negligible performance reduction.
Poster
Yufei Guo · Xiaode Liu · Yuanpei Chen · Weihang Peng · Yuhan Zhang · Zhe Ma

[ ExHall D ]

Abstract
Spiking Neural Networks have emerged as a promising energy-efficient alternative to Artificial Neural Networks, utilizing event-driven computation and binary spikes for information transfer. Despite their energy efficiency, SNNs face significant challenges in achieving high task accuracy, particularly when integrated with CNN-based architectures. A potential solution is the combination of Transformer models with SNNs. This paper addresses the challenge of adapting the self-attention mechanism of Transformers to the spiking paradigm by introducing a novel approach: Accurate Addition-Only Spiking Self-Attention (A2OS2A). Unlike existing methods that rely exclusively on binary spiking neurons for all components of the self-attention mechanism, our approach incorporates binary, ReLU, and ternary spiking neurons. This hybrid strategy substantially improves accuracy while maintaining non-multiplicative computations. Furthermore, our method eliminates the need for softmax and scaling operations. Extensive experiments demonstrate that the A2OS2A-based Spiking Transformer outperforms existing SNN-based Transformers on both static and neuromorphic datasets, achieving an accuracy of 78.66\% on ImageNet-1K. Our work represents a significant advancement in SNN-based Transformer models, offering a more accurate and efficient solution for real-world applications.
Poster
Chao Yuan · Guiwei Zhang · Changxiao Ma · Tianyi Zhang · Guanglin Niu

[ ExHall D ]

Abstract
Person re-identification (ReID) aims to extract accurate identity representation features. However, during feature extraction, individual samples are inevitably affected by noise (background, occlusions, and model limitations). Considering that features from the same identity follow a normal distribution around identity centers after training, we propose a Training-Free Feature Centralization ReID framework by aggregating the same identity features to reduce individual sample noise and enhance the stability of identity representation, which preserves the feature's original distribution for following strategies such as re-ranking. Specifically, to obtain samples of the same identity, we introduce two components: Identity-Guided Pedestrian Generation: by leveraging identity features to guide the generation process, we obtain high-quality images with diverse poses, ensuring identity consistency even in complex scenarios such as infrared, and occlusion. Neighbor Feature Centralization: it explores each sample's potential positive samples from its neighborhood. Experiments demonstrate that our generative model exhibits strong generalization capabilities and maintains high identity consistency. With the Feature Centralization framework, we achieve impressive performance even with an ImageNet pre-trained model without ReID training, reaching mAP/Rank-1 of 52.81/78.92 on Market1501. Moreover, our method sets new state-of-the-art results across standard, cross-modality, and occluded ReID tasks, showcasing strong adaptability.
Poster
Keqi Chen · vinkle srivastav · Didier MUTTER · Nicolas Padoy

[ ExHall D ]

Abstract
Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from multi-view synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric features and appearance features for association and decodes the geometric features to predict the 2d positions in the original view. To train the model, we apply Hungarian matching to bridge the gap between instance-wise and image-wise distances, and then utilize synchronization labels for metric learning. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view localization and pairwise edge association. Extensive …
Poster
Jiaqi Zhao · Zeyu Ding · Yong Zhou · Hancheng Zhu · Wen-Liang Du · Rui Yao

[ ExHall D ]

Abstract
The diffusion model has been successfully applied to various detection tasks. However, it still faces several challenges when used for oriented object detection: objects that are arbitrarily rotated require the diffusion model to encode their orientation information; uncontrollable random boxes inaccurately locate objects with dense arrangements and extreme aspect ratios; oriented boxes result in the misalignment between them and image features. To overcome these limitations, we propose ReDiffDet, a framework that formulates oriented object detection as a rotation-equivariant denoising diffusion process. First, we represent an oriented box as a 2D Gaussian distribution, forming the basis of the denoising paradigm. The reverse process can be proven to be rotation-equivariant within this representation and model framework. Second, we design a conditional encoder with conditional boxes to prevent boxes from being randomly placed across the entire image. Third, we propose an aligned decoder for alignment between oriented boxes and image features. The extensive experiments demonstrate ReDiffDet achieves promising performance and significantly outperforms the diffusion model baseline.
Poster
Maochen Yang · Zekun Li · Jian Zhang · Lei Qi · Yinghuan Shi

[ ExHall D ]

Abstract
Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called \textbf{Taste More Taste Better} (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model (VSSM) as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. The source code is provided in the supplementary material.
Poster
Longtao Jiang · Zhendong Wang · Jianmin Bao · Wengang Zhou · Dongdong Chen · Lei Shi · Dong Chen · Houqiang Li

[ ExHall D ]

Abstract
Object removal has so far been dominated by the mask-and-inpain paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths.Experimental results demonstrate that our model, SmartEraser, significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions. We will release the code, dataset, and models.
Poster
Jae-Woo KIM · Ue-Hwan Kim

[ ExHall D ]

Abstract
While current state-of-the-art Scene Change Detection (SCD) approaches achieve impressive results in well-trained research data, they become unreliable under unseen environments and different temporal conditions; in-domain performance drops from 77.6\% to 8.0\% in a previously unseen environment and to 4.6\% under a different temporal condition---calling for generalizable SCD and benchmark. In this work, we propose the Generalizable Scene Change Detection Framework (GeSCF), which addresses unseen domain performance and temporal consistency---to meet the growing demand for anything SCD. Our method leverages the pre-trained Segment Anything Model (SAM) in a zero-shot manner. For this, we design Initial Pseudo-mask Generation and Geometric-Semantic Mask Matching---seamlessly turning user-guided prompt and single-image based segmentation into scene change detection for a pair of inputs without guidance. Furthermore, we define the Generalizable Scene Change Detection (GeSCD) benchmark along with novel metrics and an evaluation protocol to facilitate SCD research in generalizability. In the process, we introduce the ChangeVPR dataset, a collection of challenging image pairs with diverse environmental scenarios---including urban, suburban, and rural settings. Extensive experiments across various datasets demonstrate that GeSCF achieves an average performance gain of 19.2\% on existing SCD datasets and 30.0\% on the ChangeVPR dataset, nearly doubling the prior art performance. We believe our …
Poster
Weixiao Gao · Liangliang Nan · Hugo Ledoux

[ ExHall D ]

Abstract
Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes—offering richer spatial representation—remain underexplored. This paper introduces SUM Parts, the \textbf{first} large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5km^2 with 21 classes. The dataset was created using our designed annotation tool, supporting both face and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset.
Poster
Oliver Hahn · Christoph Reich · Nikita Araslanov · Daniel Cremers · Christian Rupprecht · Stefan Roth

[ ExHall D ]

Abstract
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
Poster
Hongyi Zeng · Wenxuan Liu · Tianhua Xia · Jinhui Chen · Ziyun Li · Sai Qian Zhang

[ ExHall D ]

Abstract
Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing the instance of interest (IOI), reducing computational load and enhancing real-time performance.In this paper, we present a~\textit{foveated instance segmentation} (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.52 on CityScapes and 0.43 on ADE20K, notably outperforming the baseline.
Poster
Yushan Zhang · Aljoša Ošep · Laura Leal-Taixe · Tim Meinhardt

[ ExHall D ]

Abstract
Zero-shot 4D segmentation of arbitrary objects in Lidar is of crucial importance for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar-4D), a method that utilizes multi-modal sensory robotic setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporally consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over 5 PQ, and unlock Zero-Shot 4D LPS.
Poster
Markus Karmann · Onay Urfalioglu

[ ExHall D ]

Abstract
Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels.State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model.In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion.We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain.Pixel-wise counting of the required number of iterations along the Markov chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map.Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions.We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation.Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets.
Poster
Saad Lahlali · Sandra Kara · Hejer AMMAR · Florian Chabot · Nicolas Granger · Hervé Le Borgne · Quoc Cuong PHAM

[ ExHall D ]

Abstract
Object discovery, which refers to the process of localizing objects without human annotations, has gained significant attention in recent years. Despite the growing interest in this task for 2D images, it remains under-explored in 3D data, where it is typically restricted to localizing a single object. Our work leverages the latest advances in 2D object discovery and proposes a novel framework to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we propose DIOD-3D, the first method for multi-object discovery in 3D data, using scene completion as a supporting task to enable dense object discovery from sparse inputs; (ii) we develop xMOD, a cross-modal training framework that integrates both 2D and 3D data, using objective functions tailored to accommodate the sparse nature of 3D data. xMOD uses a teacher-student training across the two modalities to reduce confirmation bias by leveraging the domain gap. During inference, the model supports RGB-only, point cloud-only and multi-modal inputs. We validate the approach in the three settings, on synthetic photo-realistic and real-world datasets. Notably, our approach yields a substantial improvement in F1@50 score compared with the state of the art by 8.7 points in real-world scenarios, demonstrating the potential of …
Poster
Shengqiong Wu · Hao Fei · Jingkang Yang · Xiangtai Li · Juncheng Li · Hanwang Zhang · Tat-seng Chua

[ ExHall D ]

Abstract
The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by an average of 14.62%, highlighting the effectiveness of our method.
Poster
Jaime Corsetti · Francesco Giuliari · Alice Fasoli · Davide Boscaini · Fabio Poiesi

[ ExHall D ]

Abstract
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like ‘turn on the ceiling light,’ an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. …
Poster
Jialin Zhu · Jiangbei Yue · Feixiang He · He Wang

[ ExHall D ]

Abstract
Recently, 3D Gaussian Splatting (3DGS) provides a new framework for novel view synthesis, and has spiked a new wave of research in neural rendering and related applications. As 3DGS is becoming a foundational component of many models, any improvement on 3DGS itself can bring huge benefits. To this end, we aim to improve the fundamental paradigm and formulation of 3DGS. We argue that as an unnormalized mixture model, it needs to be neither Gaussians nor splatting. We subsequently propose a new mixture model consisting of flexible Student's t distributions, with both positive (splatting) and negative (scooping) densities. We name our model Student Splatting and Scooping, or SSS. When providing better expressivity, SSS also poses new challenges in learning. Therefore, we also propose a new principled sampling approach for optimization. Through exhaustive evaluation and comparison, across multiple datasets, settings, and metrics, we demonstrate that SSS outperforms existing methods in terms of quality and parameter efficiency, e.g. achieving matching or better quality with similar numbers of components, and obtaining comparable results while reducing the component number by as much as 82%.
Poster
Jiaxin Shi · Mingyue Xiang · Hao Sun · Yixuan Huang · Zhi Weng

[ ExHall D ]

Abstract
3D Vision Grounding (3DVG) is a fundamental research area that enables agents to perceive and interact with the 3D world. The challenge of the 3DVG task lies in understanding fine-grained semantics and spatial relationships within both the utterance and 3D scene. To address this challenge, we propose a zero-shot neuro-symbolic framework that utilizes a large language model (LLM) as neuro-symbolic functions to ground the object within the 3D Gaussian Splatting (3DGS) representation. By utilizing 3DGS representation, we can dynamically render high-quality 2D images from various viewpoints to enrich the semantic information. Given the complexity of spatial relationships, we construct a relationship graph and chain of semantics that decouple spatial relationships and facilitate step-by-step reasoning within 3DGS representation. Additionally, we employ a grounded-aware self-check mechanism to enable the LLM to reflect on its responses and mitigate the effects of ambiguity in spatial reasoning. We evaluate our method using two publicly available datasets, Nr3D and Sr3D, achieving accuracies of 60.8\% and 91.4\%, respectively. Notably, our method surpasses current state-of-the-art zero-shot methods on the Nr3D dataset. In addition, it outperforms the recent supervised models on the Sr3D dataset.
Poster
Jiangyong Huang · Baoxiong Jia · Yan Wang · Ziyu Zhu · Xiongkun Linghu · Qing Li · Song-Chun Zhu · Siyuan Huang

[ ExHall D ]

Abstract
Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a “mist” that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the “mist”, we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language …
Poster
Qihang Peng · Henry Zheng · Gao Huang

[ ExHall D ]

Abstract
Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing …
Poster
Ronghao Dang · Yuqian Yuan · Wenqi Zhang · Yifei Xin · Boqiang Zhang · Long Li · Liuyi Wang · qinyang zeng · Xin Li · Lidong Bing

[ ExHall D ]

Abstract
The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed.To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies.Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code will be open-sourced.
Poster
Filippo Ziliotto · Tommaso Campari · Luciano Serafini · Lamberto Ballan

[ ExHall D ]

Abstract
Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.
Poster
Xiangyuan Xue · Zeyu Lu · Di Huang · ZiDong Wang · Wanli Ouyang · Lei Bai

[ ExHall D ]

Abstract
Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents’s ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15\% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with …
Poster
Yunzhi Zhang · Zizhang Li · Matt Zhou · Shangzhe Wu · Jiajun Wu

[ ExHall D ]

Abstract
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms an automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing.
Poster
Yongshuo Zong · Qin ZHANG · DONGSHENG An · Zhihua Li · Xiang Xu · Linghan Xu · Zhuowen Tu · Yifan Xing · Onkar Dabeer

[ ExHall D ]

Abstract
In this paper, we present a simple yet effective workflow for automatically scaling instruction-following data to elicit the pixel-level grounding capabilities of VLMs under complex instructions. We address five critical real-world challenges: hallucination, multi-object scenarios, reasoning, multi-granularity, and part-level reference. By distilling visual-language knowledge from a teacher model, our workflow generates instruction-response pairs that link with existing, abundant pixel-level annotations of the images, minimizing the need for human annotation. We refer to the resulting dataset as Ground-V, which captures extensive object localization knowledge and nuanced pixel-level referring expressions. Experimental results show that models of various architectures trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieve an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Poster
Artemis Panagopoulou · Honglu Zhou · silvio savarese · Caiming Xiong · Chris Callison-Burch · Mark Yatskar · Juan Carlos Niebles

[ ExHall D ]

Abstract
Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes.Yet on benchmark visual reasoning data, when answering correctly, such models produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer meant to verify logical correctness of a program produced for a given query.Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models …
Poster
Lei Li · wei yuancheng · Zhihui Xie · Xuqing Yang · Yifan Song · Peiyi Wang · Chenxin An · Tianyu Liu · Sujian Li · Bill Yuchen Lin · Lingpeng Kong · Qi Liu

[ ExHall D ]

Abstract
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models.To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations.Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4\% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs.Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.3\% accuracy for a 7B VL-GenRM).We believe VL-RewardBench along with the experimental insights will become a valuable resource for …
Poster
Xingrui Wang · Wufei Ma · Tiezheng Zhang · Celso M. de Melo · Jieneng Chen · Alan L. Yuille

[ ExHall D ]

Abstract
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities.To address this limitation, we present **PulseCheck457**, a scalable and unbiased synthetic dataset designed with **4** key spatial components: multi-object recognition, 2D and 3D spatial relationships, and 3D orientation. **PulseCheck457** supports a cascading evaluation structure, offering **7** question types across **5** difficulty levels that progress from basic single-object recognition to our newly proposed complex 6D spatial reasoning tasks.We evaluated various large multimodal models (LMMs) on **PulseCheck457**, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
Poster
Aayush Dhakal · Srikumar Sastry · Subash Khanal · Adeel Ahmad · Eric Xing · Nathan Jacobs

[ ExHall D ]

Abstract
The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1\% on classification tasks and 0.145 R2 on regression tasks. All our code will be released on GitHub. Our models will be released on HuggingFace.
Poster
Jingyuan Yang · Jiawei Feng · Weibin Luo · Dani Lischinski · Daniel Cohen-Or · Hui Huang

[ ExHall D ]

Abstract
Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses.This task is inherently complex due to its twofold objective: significantly evoking the intended emotion, while preserving the original image composition.Existing AIM methods primarily adjust color and style, often failing to elicit precise and profound emotional shifts.Drawing on psychological insights, we introduce EmoEdit, which extends AIM by incorporating content modifications to enhance emotional impact.Specifically, we first construct EmoEditSet, a large-scale AIM dataset comprising 40,120 paired data through emotion attribution and data construction.To make existing generative models emotion-aware, we design the Emotion adapter and train it using EmoEditSet.We further propose an instruction loss to capture the semantic variations in data pairs.Our method is evaluated both qualitatively and quantitatively, demonstrating superior performance compared to existing state-of-the-art techniques.Additionally, we showcase the portability of our Emotion adapter to other diffusion-based models, enhancing their emotion knowledge with diverse semantics.
Poster
Qu Yang · QingHongYa Shi · Tongxin Wang · Mang Ye

[ ExHall D ]

Abstract
Understanding intention and emotion from social media poses unique challenges due to the inherent uncertainty in multimodal data, where posts often contain incomplete or missing modalities. While this uncertainty reflects real-world scenarios, it remains underexplored within the computer vision community, particularly in conjunction with the intrinsic relationship between emotion and intention. To address these challenges, we introduce the Multimodal IntentioN and Emotion Understanding in the Wild (MINE) dataset, comprising over 20,000 topic-specific social media posts with natural modality variations across text, image, video, and audio. MINE is distinctively constructed to capture both the uncertain nature of multimodal data and the implicit correlations between intentions and emotions, providing extensive annotations for both aspects. To tackle these scenarios, we propose the Bridging Emotion-Intention via Implicit Label Reasoning (BEAR) framework. BEAR consists of two key components: a BEIFormer that leverages emotion-intention correlations, and a Modality Asynchronous Prompt that handles modality uncertainty. Experiments show that BEAR outperforms existing methods in processing uncertain multimodal data while effectively mining emotion-intention relationships for social media content understanding. Dataset and code will be released.
Poster
Size Wu · Sheng Jin · Wenwei Zhang · Lumin Xu · Wentao Liu · Wei Li · Chen Change Loy

[ ExHall D ]

Abstract
Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM---grounding \emph{frozen} off-the-shelf LMMs in human-AI conversations---a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, our F-LMM can be directly …
Poster
Rui Qian · Xin Yin · Dejing Dou

[ ExHall D ]

Abstract
Current Large Multimodal Models (LMMs) empowered tasks such as visual grounding and segmentation typically rely on <SEG> token as a text prompting to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM). However, we observe that little research has looked into how it works when mapping language vocabulary embedding into corresponding vision codebook space. In this work, we first visualize the similarity maps, \aka pseudo images, which are obtained by computing the dot product similarity between the <SEG> token and the image token embedings derived from the last hidden layer in both LLaVA and SAM models. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the pseudo images, which reveals that what <SEG> token contributes to is the semantic correspondences from image-text pairs. Specifically, <SEG> token, a placeholder expanded in text vocabulary, extensively queries within individual tokenized image patches to map the semantics of an object from text to the paired image while the Large Language Models (LLMs) is being fine tined. Upon above findings, we present READ, which facilitates LMMs' resilient REAsoning capability of where to atten\textbf{D} under the guidance of highly activated points …</seg></seg></seg></seg>
Poster
Yanyuan Chen · Dexuan Xu · Yu Huang · Songkun Zhan · Hanpin Wang · Dongxue Chen · Xueping Wang · Meikang Qiu · Hang Li

[ ExHall D ]

Abstract
Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.
Poster
Yuzhong Zhao · Feng Liu · Yue Liu · Mingxiang Liao · Chen GONG · Qixiang Ye · Fang Wan

[ ExHall D ]

Abstract
One important task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. This process essentially constructs a set of region representations, where suitable representations for specific tasks can be matched. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the set of views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is enclosed in the supplementary material.
Poster
Zhen Yang · Zhuo Tao · Qi Chen · Yuankai Qi · Liang Li · Anton van den Hengel · Qingming Huang

[ ExHall D ]

Abstract
Knowledge-based visual question answering (KBVQA) separates image interpretation and knowledge retrieval into separate processes, motivated in part by the fact that they are very different tasks. In this paper, we transform the KBVQA into linguistic question-answering tasks so that we can leverage the rich world knowledge and strong reasoning abilities of Large Language Models (LLMs). The caption-then-question approach to KBVQA has been effective but relies on the captioning method to describe the detail required to answer every possible question. We propose instead a Question-Aware Captioner (QACap), which uses the question as guidance to extract correlated visual information from the image and generate a question-related caption. To train such a model, we utilize GPT-4 to build a corresponding high-quality question-aware caption dataset on top of existing KBVQA datasets. Extensive experiments demonstrate that our QACap model and dataset significantly improve KBVQA performance. Our method, QACap, achieves 68.2\% accuracy on the OKVQA validation set, 73.4\% on the direct-answer part of the A-OKVQA validation set, and 74.8\% on the multiple-choice part, all setting new SOTA benchmarks.
Poster
Hang Hua · Qing Liu · Lingzhi Zhang · Jing Shi · Soo Ye Kim · Zhifei Zhang · Yilin Wang · Jianming Zhang · Zhe Lin · Jiebo Luo

[ ExHall D ]

Abstract
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate integration of visual and textual information across various applications, including image and video captioning, visual question answering, and cross-modal retrieval.Despite their superior capabilities, VLMs still struggle with fine-grained compositional image region descriptions. Specifically, they have difficulty recognizing arbitrary segmentation masks as referential inputs, interpreting compositional aspect instructions for referencing, and precisely describing the compositional aspects of a region. However, compositionality—the ability to understand and generate novel combinations of known visual and textual components—is critical for facilitating coherent reasoning and understanding across modalities in VLMs. To address this issue, we propose OpenCompositionCap, a new dataset for multi-grained region compositional image captioning that distinguishes itself from prior works by introducing the new task of compositional aspect-aware regional image captioning. To support this endeavor, we also introduce a new VLM model, FineCaption. The empirical results illustrate the effectiveness of our proposed model compared with other strong VLMs. In addition, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.
Poster
Yan Li · Yifei Xing · Xiangyuan Lan · Xin Li · Haifeng Chen · Dongmei Jiang

[ ExHall D ]

Abstract
Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
Poster
Yuanmin Tang · Jing Yu · Keke Gai · Jiamin Zhuang · Gang Xiong · Gaopeng Gou · Qi Wu

[ ExHall D ]

Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art …
Poster
Bangbang Zhou · Zuan Gao · Zixiao Wang · Boqiang Zhang · Yuxin Wang · Zhineng Chen · Hongtao Xie

[ ExHall D ]

Abstract
Due to the limited scale of multimodal table understanding (MTU) data, model performance is constrained. A straightforward approach is to use multimodal large language models to obtain more samples, but this may cause hallucinations, generate incorrect sample pairs, and cost significantly. To address the above issues, we design a simple yet effective synthesis framework that consists of two independent steps: table image rendering and table question and answer (Q\&A) pairs generation. We use table codes (HTML, LaTeX, Markdown) to synthesize images and generate Q\&A pairs with large language model (LLM). This approach leverages LLM’s high concurrency and low cost to boost annotation efficiency and reduce expenses. By inputting code instead of images, LLMs can directly access the content and structure of the table, reducing hallucinations in table understanding and improving the accuracy of generated Q\&A pairs. Finally, we synthesize a large-scale MTU dataset, SynTab, containing 636K images and 1.8M samples costing within $200 in US dollars. We further introduce a generalist tabular multimodal model, SynTab-LLaVA. This model not only effectively extracts local textual content within the table but also enables global modeling of relationships between cells. SynTab-LLaVA achieves SOTA performance on 21 out of 24 in-domain and out-of-domain benchmarks, demonstrating …
Poster
Daiqing Qi · Handong Zhao · Jing Shi · Simon Jenni · Yifei Fan · Franck Dernoncourt · Scott Cohen · Sheng Li

[ ExHall D ]

Abstract
Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in *William Eggleston’s Guide*, “While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky.” Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component—a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for …
Poster
Jun Chen · Dannong Xu · Junjie Fei · Chun-Mei Feng · Mohamed Elhoseiny

[ ExHall D ]

Abstract
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9\% and 11\% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets will be made publicly available.
Poster
Ryota Tanaka · Taichi Iki · Taku Hasegawa · Kyosuke Nishida · Kuniko Saito · Jun Suzuki

[ ExHall D ]

Abstract
We aim to develop a retrieval-augmented generation (RAG) framework capable of answering questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we present a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance of VDocRAG, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Poster
Linke Ouyang · Yuan Qu · Hongbin Zhou · Jiawei Zhu · Rui Zhang · Qunshu Lin · Bin Wang · Zhiyuan Zhao · Man Jiang · Xiaomeng Zhao · Jin Shi · Fan Wu · Pei Chu · Minghao Liu · Zhenxiang Li · Chao Xu · Bo Zhang · Botian Shi · Zhongying Tu · Conghui He

[ ExHall D ]

Abstract
Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.
Poster
Haoxin Li · Boyang Li

[ ExHall D ]

Abstract
Despite impressive advancements in various multimodal tasks, vision-language models (VLMs) still struggle with compositional understanding due to limited exposure to training samples that contain subtle variations within paired examples. With advances in multimodal generative models, a natural solution is to generate synthetic samples with subtle variations for training VLMs. However, generating and training on synthetic samples with subtle variations presents two challenges: difficulty in accurately creating precise variations and inconsistency in cross-modal alignment quality. To address these challenges, we propose SVD-GT (Subtle Variation Data Generation and Training), which integrates image feature injection into a text-to-image generative model to enhance the quality of synthetic variations and employs an adaptive margin loss to differentiate samples using adaptive margins, which help filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluations on four compositional understanding benchmarks demonstrate that SVD-GT significantly improves the compositionality of VLMs, boosting the average accuracy of CLIP by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.
Poster
Gensheng Pei · Tao Chen · Yujia Wang · Xinhao Cai · Xiangbo Shu · Tianfei Zhou · Yazhou Yao

[ ExHall D ]

Abstract
The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP’s training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions.In this work, we present a concise yet effective approach called Patch Generation-to-Selection (CLIP-PGS) to enhance CLIP’s training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix.Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and …
Poster
Xugong Qin · peng zhang · Jun Jie Ou Yang · Gangyan Zeng · Yubo Li · Yuanyuan Wang · Wanqian Zhang · Pengwen Dai

[ ExHall D ]

Abstract
Scene Text Retrieval (STR) seeks to identify all images containing a given query string. Existing methods typically rely on an explicit Optical Character Recognition (OCR) process of text spotting or localization, which is susceptible to complex pipelines and accumulated errors. To settle this, we resort to the Contrastive Language-Image Pre-training (CLIP) models, which have demonstrated the capacity to perceive and understand scene text, making it possible to achieve strictly OCR-free STR. From the perspective of parameter-efficient transfer learning, a lightweight visual position adapter is proposed to provide a positional information complement for the visual encoder. Besides, we introduce a visual context dropout technique to improve the alignment of local visual features. A novel, parameter-free cross-attention mechanism transfers the contrastive relationship between images and text to that between tokens and text, producing a rich cross-modal representation, which can be utilized for efficient reranking with a linear classifier. The resulting model, CAYN, achieves new state-of-the-art performance on the STR task, with 92.46\%/89.49\%/85.98\% mAP on the SVT/IIIT-STR/TTR datasets at 38.79 FPS on a single GeForce GTX 1080 Ti. Our findings demonstrate that CLIP can serve as a reliable and efficient solution for OCR-free STR, with no more than 0.50M additional parameters required. The …
Poster
Rui Xiao · Sanghwan Kim · Iuliana Georgescu · Zeynep Akata · Stephan Alaniz

[ ExHall D ]

Abstract
CLIP has shown impressive results in aligning images and text at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code and model checkpoints will be released upon acceptance.
Poster
Yuheng Feng · Changsong Wen · Zelin Peng · Li jiaye · Siyu Zhu

[ ExHall D ]

Abstract
Contrastive language-image pretraining models like CLIP have shown strong performance in various text-image alignment tasks. However, CLIP’s 77-token input limit and short-text training data restrict its effectiveness in long-text tasks. To address these limitations, we introduce LongD-CLIP, a dual-teacher distillation framework that enhances long-text representation while preventing knowledge forgetting. In our approach, a teacher model fine-tuned on long-text data distills rich representation knowledge into the student model, while the original CLIP model serves as a secondary teacher to help the student retain foundational knowledge. Experimental results show that LongD-CLIP achieves substantial improvements across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP outperforms Long-CLIP by 2.5%, achieving 98.3%. On the Urban-1k dataset, it shows a 9.2% improvement, reaching 91.9%, which demonstrates its robust generalization ability. Additionally, LongD-CLIP’s text encoder exhibits reduced drift in latent space and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.
Poster
Dahyun Kang · Piotr Bojanowski · Huy V. Vo · Théo Moutakanni · Cijo Jose · Federico Baldassarre · Patrick Labatut · Michael Ramamonjisoa · Maxime Oquab · Timothée Darcet · Hu Xu · Shang-Wen Li · Oriane Simeoni · Marc Szafraniec

[ ExHall D ]

Abstract
Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named dtxt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model, but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks,such as concatenating the [CLS] token with the patch average to train the alignment, curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
Poster
Davide Berasi · Matteo Farina · Massimiliano Mancini · Elisa Ricci · Nicola Strisciuglio

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data.We propose a framework, called Geodesically Decomposable Embeddings (GDE), that addresses these problems and approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable.
Poster
Jiuhai Chen · Jianwei Yang · Haiping Wu · Dianqi Li · Jianfeng Gao · Tianyi Zhou · Bin Xiao

[ ExHall D ]

Abstract
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose depth-breath fusion (DBFusion)'' to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models …
Poster
Chenyu Yang · Xuan Dong · Xizhou Zhu · Weijie Su · Jiahao Wang · Hao Tian · Zhe Chen · Wenhai Wang · Lewei Lu · Jifeng Dai

[ ExHall D ]

Abstract
Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.
Poster
Yaqi Zhao · Yuanyang Yin · Lin Li · Mingan Lin · Victor Shea-Jay Huang · Siwei Chen · Weipeng Chen · Baoqun Yin · Zenan Zhou · Wentao Zhang

[ ExHall D ]

Abstract
Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model’s interpretive range.To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data—images whose ambiguous visual representations challenge the VE’s interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM’s capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment.Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight …
Poster
Luo · Xue Yang · Wenhan Dou · Zhaokai Wang · Jiawen Liu · Jifeng Dai · Yu Qiao · Xizhou Zhu

[ ExHall D ]

Abstract
In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency. …
Poster
Xubing Ye · Yukang Gan · Yixiao Ge · Xiao-Ping Zhang · Yansong Tang

[ ExHall D ]

Abstract
Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strategy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input instance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks.
Poster
Dominik Schnaus · Nikita Araslanov · Daniel Cremers

[ ExHall D ]

Abstract
The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e., without parallel data. We present the first study towards this prospect, and investigate conformity of existing vision and language foundation models in the context of "blind" matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens possibility for exciting applications embedding semantic knowledge into other modalities. As a showcase, we demonstrate a proof-of-concept unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.
Poster
Kun Zhang · Jingyu Li · Zhe Li · S Kevin Zhou

[ ExHall D ]

Abstract
Vision-Language (VL) alignment across image and text modalities is a challenging task due to the inherent semantic ambiguity of data with multiple possible meanings. Existing methods typically solve it by learning multiple sub-representation spaces to encode each input data as a set of embeddings, and constraining diversity between whole subspaces to capture diverse semantics for accurate VL alignment. Despite their promising outcomes, existing methods suffer two imperfections: 1) actually, specific semantics is mainly expressed by some local dimensions within the subspace. Ignoring this intrinsic property, existing diversity constraints imposed on the whole subspace may impair diverse embedding learning; 2) multiple embeddings are inevitably introduced, sacrificing computational and storage efficiency. In this paper, we propose a simple yet effective Diverse and Hybrid Set-embeddings learning framework (DH-Set), which is distinct from prior work in three aspects. DH-Set 1) devises a novel semantic importance dissecting method to focus on key local dimensions within each subspace; and thereby 2) not only imposes finer-grained diversity constraint to improve the accuracy of diverse embedding learning, 3) but also mixes key dimensions of all subspaces into the single hybrid embedding to boost inference efficiency. Extensive experiments on various benchmarks and model backbones show the superiority of DH-Set …
Poster
Zhangqi Jiang · Junkai Chen · Beier Zhu · Tingjin Luo · Yankun Shen · Xu Yang

[ ExHall D ]

Abstract
Hallucinations in Large Vision-Language Models (LVLMs) significantly undermine their reliability, motivating researchers to explore the causes of hallucination. However, most studies primarily focus on the language aspect rather than the visual. In this paper, we address how LVLMs process visual information and whether this process causes hallucination. Firstly, we use the attention lens to identify the stages at which LVLMs handle visual data, discovering that the middle layers are crucial. Moreover, we find that these layers can be further divided into two stages: "visual information enrichment" and "semantic refinement" which respectively propagate visual data to object tokens and interpret it through text. By analyzing attention patterns during the visual information enrichment stage, we find that real tokens consistently receive higher attention weights than hallucinated ones, serving as a strong indicator of hallucination. Further examination of multi-head attention maps reveals that hallucination tokens often result from heads interacting with inconsistent objects. Based on these insights, we propose a simple inference-time method that adjusts visual attention by integrating information across various heads. Extensive experiments demonstrate that this approach effectively mitigates hallucinations in mainstream LVLMs without additional training costs. Our code will be released at: https://anonymous.4open.science/r/middle_layers_indicating_hallucinations-C45A.
Poster
Yuncheng Guo · Xiaodong Gu

[ ExHall D ]

Abstract
Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders—where dataset-specific features are more prominent—while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL …
Poster
Zixuan Hu · Yongxian Wei · Li Shen · Chun Yuan · Dacheng Tao

[ ExHall D ]

Abstract
Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.
Poster
Soumya Suvra Ghosal · Souradip Chakraborty · Vaibhav Singh · Tianrui Guan · Mengdi Wang · Ahmad Beirami · Furong Huang · Alvaro Velasquez · Dinesh Manocha · Amrit Singh Bedi

[ ExHall D ]

Abstract
With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks—carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.
Poster
Yue Cao · Yun Xing · Jie Zhang · Di Lin · Tianwei Zhang · Ivor Tsang · Yang Liu · Qing Guo

[ ExHall D ]

Abstract
Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent.Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration.The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image.This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications.Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while …
Poster
Zhaoyi Liu · Huan Zhang

[ ExHall D ]

Abstract
Self-supervised learning (SSL) vision encoders learn high-quality image representations and thus have become a vital part of developing vision modality of large vision language models (LVLMs). Due to the high cost of training such encoders, pre-trained encoders are widely shared and deployed into many LVLMs, which are security-critical or bear societal significance. Under this practical scenario, we reveal a new backdoor threat that significant visual hallucinations can be induced into these LVLMs by merely compromising vision encoders. Because of the sharing and reuse of these encoders, many downstream LVLMs may inherit backdoor behaviors from encoders, leading to widespread backdoors. In this work, we propose BadVision, the first method to exploit this vulnerability in SSL vision encoders for LVLMs with novel trigger optimization and backdoor learning techniques. We evaluate BadVision on two types of SSL encoders and LVLMs across eight benchmarks. We show that BadVision effectively drives the LVLMs to attacker-chosen hallucination with over 99\% attack success rate, causing a 77.6\% relative visual understanding error while maintaining the stealthiness. SoTA backdoor detection methods cannot detect our attack effectively.
Poster
Yuchen Ren · Zhengyu Zhao · Chenhao Lin · Bo Yang · Lu Zhou · Zhe Liu · Chao Shen

[ ExHall D ]

Abstract
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement method by up to 7.0\% on average.We also validate its superior against popular defenses and its compatibility with other transfer methods.
Poster
Jenny Schmalfuss · Nadine Chang · Vibashan VS · Maying Shen · Andrés Bruhn · Jose M. Alvarez

[ ExHall D ]

Abstract
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems.VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive.Hence, it is crucial determining whether VLMs inherit this instability to varying prompts.We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations.To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis.Regarding prompt variations, experimental results from PARC show that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations are those that change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family.We further find indications that prompt sensitivity is linked more closely to training data than to model size.Code and datasets will be released.
Poster
Yassir Bendou · Amine Ouasfi · Vincent Gripon · Adnane Boukhayma

[ ExHall D ]

Abstract
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Leveraging this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing over the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, that we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performance across 11 datasets in the standard few-shot adaptation benchmark.
Poster
Maxime Zanella · Clément Fuchs · Christophe De Vleeschouwer · Ismail Ben Ayed

[ ExHall D ]

Abstract
The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models’ initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code will be made available.
Poster
Dengyang Jiang · Haoyu Wang · Lei Zhang · Wei Wei · Guang Dai · Mengmeng Wang · Jingdong Wang · Yanning Zhang

[ ExHall D ]

Abstract
Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations, have proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit non-trivial bias, which is not only non-transferable across either categories or domains, but also inevitably memorized by the backbone, thus causing its generalization capacity degeneration. To mitigate this problem, we present an \textbf{u}n\textbf{b}iased general annotated dataset \textbf{gen}eration framework (\textbf{ubGen}). Instead of expensive manual collection, we aim at directly generating synthetic unbiased images with category annotations. To achieve this goal, we propose to leverage the advantage of multimodal foundation model (e.g., CLIP), in terms of aligning images with language in an unbiased semantic space. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into an quality assurance loss to preserve the quality of the generated image. By leveraging these …
Poster
Chaoyang Li · Jianyang Qin · Jinhao Cui · Zeyu Liu · Ning Hu · Qing Liao

[ ExHall D ]

Abstract
Multi-task prompt learning has emerged as a promising technique for fine-tuning pre-trained Vision-Language Models (VLMs) to various downstream tasks. However, existing methods ignore challenges caused by spurious correlations and dynamic task relationships, which may reduce the model performance. To tackle these challenges, we propose JSCPT, a novel approach for \textit{Joint Scheduling of Causal Prompts and Tasks} to enhance multi-task prompt learning. Specifically, we first design a \textit{Multi-Task Vison-Language Prompt} (MTVLP) model, which learns task-shared and task-specific vison-language prompts and selects useful prompt features via causal intervention, alleviating spurious correlations. Then, we propose the task-prompt scheduler that models inter-task affinities and assesses the causal effect of prompt features to optimize the multi-task prompt learning process. Finally, we formulate the scheduler and the multi-task prompt learning process as a bi-level optimization problem to optimize prompts and tasks adaptively. In the lower optimization, MTVLP is updated with the scheduled gradient, while in the upper optimization, the scheduler is updated with the implicit gradient. Extensive experiments show the superiority of our proposed JSCPT approach over several baselines in terms of multi-task prompt learning for pre-trained VLMs.
Poster
Hairui Ren · Fan Tang · He Zhao · Zixuan Wang · Dandan Guo · Yi Chang

[ ExHall D ]

Abstract
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods.In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification.Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment.Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
Poster
Xiangyan Qu · Gaopeng Gou · Jiamin Zhuang · Jing Yu · Kun Song · Qihao Wang · Yili Li · Gang Xiong

[ ExHall D ]

Abstract
Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our …
Poster
Jinpeng Wang · Tianci Luo · Yaohua Zha · Yan Feng · Ruisheng Luo · Bin Chen · Tao Dai · Long Chen · Yaowei Wang · Shu-Tao Xia

[ ExHall D ]

Abstract
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: ***prompt condensation***. ather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone and an extra pre-alignment objective, Condenser ensures stability and accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code will be open-sourced at https://anonymous.4open.science/r/VICL-Condenser.
Poster
Jung-Ho Hong · Ho-Joong Kim · Kyu-Sung Jeon · Seong-Whan Lee

[ ExHall D ]

Abstract
The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.
Poster
Jungsoo Lee · Debasmit Das · Munawar Hayat · Sungha Choi · Kyuwoong Hwang · Fatih Porikli

[ ExHall D ]

Abstract
We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and …
Poster
Debora Caldarola · Pietro Cagnasso · Barbara Caputo · Marco Ciccone

[ ExHall D ]

Abstract
Federated learning (FL) enables collaborative model training with privacy preservation. Data heterogeneity across edge devices (clients) can cause models to converge to sharp minima, negatively impacting generalization and robustness. Recent approaches use client-side sharpness-aware minimization (SAM) to encourage flatter minima, but the discrepancy between local and global loss landscapes often undermines their effectiveness, as optimizing for local sharpness does not ensure global flatness. This work introduces FedGloSS (Federated Global Server-side Sharpness), a novel FL approach that prioritizes the optimization of global sharpness on the server, using SAM. To reduce communication overhead, FedGloSS cleverly approximates sharpness using the previous global gradient, eliminating the need for additional client communication. Our extensive evaluations demonstrate that FedGloSS consistently reaches flatter minima and better performance compared to state-of-the-art FL methods across various federated vision benchmarks.
Poster
Shunxin Wang · Raymond Veldhuis · Nicola Strisciuglio

[ ExHall D ]

Abstract
Frequency shortcuts refer to specific frequency patterns that models heavily rely on for correct classification. Previous studies have shown that models trained on small image datasets often exploit such shortcuts, potentially impairing their generalization performance. However, existing methods for identifying frequency shortcuts require expensive computations and become impractical for analyzing models trained on large datasets. In this work, we propose the first approach to more efficiently analyze frequency shortcuts at a larger scale. We show that both CNN and transformer models learn frequency shortcuts on ImageNet. We also expose that frequency shortcut solutions can yield good performance on out-of-distribution (OOD) test sets which largely retain texture information. However, these shortcuts, mostly aligned with texture patterns, hinder model generalization on rendition-based OOD test sets. These observations suggest that current OOD evaluations often overlook the impact of frequency shortcuts on model generalization. Future benchmarks could thus benefit from explicitly assessing and accounting for these shortcuts to build models that generalize across a broader range of OOD scenarios.
Poster
Ningyuan Tang · Minghao Fu · Jianxin Wu

[ ExHall D ]

Abstract
The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on devices with low computational resources. We explore a new visual adaptation paradigm called separated tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on devices which possess only low computational resources (slow CPU, no GPU, small memory, etc.) Existing methods that are potentially suitable for our separated tuning paradigm are discussed. But, three major drawbacks hinder their application in separated tuning: low adaptation capability, large adapter network, and in particular, high information transfer overhead. To address these issues, we propose Minimal Interaction Separated Tuning, or MIST, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIST achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.
Poster
Krishna Sri Ipsit Mantri · Carola-Bibiane Schönlieb · Bruno Ribeiro · Chaim Baskin · Moshe Eliasof

[ ExHall D ]

Abstract
Pre-trained Vision Transformers now serve as powerful tools for computer vision. Yet, efficiently adapting them for multiple tasks remains a challenge that arises from the need to modify the rich hidden representations encoded by the learned weight matrices, without inducing interference between tasks. Current parameter-efficient methods like LoRA, which apply low-rank updates, force tasks to compete within constrained subspaces, ultimately degrading performance. We introduce DiTASK, a novel Diffeomorphic Multi-Task Fine-Tuning approach that maintains pre-trained representations by preserving weight matrix singular vectors, while enabling task-specific adaptations through neural diffeomorphic transformations of the singular values. By following this approach, DiTASK enables both shared and task-specific feature modulations with minimal added parameters. Our theoretical analysis shows that DiTASK achieves full-rank updates during optimization, preserving the geometric structure of pre-trained features, and establishing a new paradigm for efficient multi-task learning (MTL). Our experiments on PASCAL MTL and NYUD show that DiTASK achieves state-of-the-art performance across four dense prediction tasks, using 75% fewer parameters than existing methods.
Poster
Jian Meng · Ahmed Hasssan · Li Yang · Deliang Fan · Jinwoo Shin · Jae-sun Seo

[ ExHall D ]

Abstract
Learning the visual representation via masked auto-encoder (MAE) training has been proven to be a powerful technique. Transferring the pre-trained vision transformer (ViT) to downstream tasks leads to superior performance compared to conventional task-by-task supervised learning. Recent research works on MAE focus on large-sized vision transformers(>50 million parameters) with outstanding performance. However, improving the generality of the under-parametrized lightweight model has been widely ignored. In practice, downstream applications are commonly intended for resource-constrained platforms, where large-scale ViT cannot easily meet the resource budget. Current lightweight MAE training heavily relies on knowledge distillation with a pre-trained teacher, whereas the root cause behind the poor performance remains under-explored. Motivated by that, this paper first introduces the concept of closest neighbor patch'' to characterize the local semantics among the input tokens. Our discovery shows that the lightweight model failed to distinguish different local information, leading to aliased understanding and poor accuracy. Motivated by this finding, we propose NoR-MAE, a novel MAE training algorithm for lightweight vision transformers. NoR-MAE elegantly repels the semantic aliasing between patches and their closest neighboring patch (semantic centroid) with negligible training cost overhead. With the ViT-Tiny model, NoR-MAE achieves up to 7.22%/3.64% accuracy improvements on ImageNet-100/ImageNet-1K datasets, as …
Poster
Mengqiao Han · Liyuan Pan · Xiabi Liu

[ ExHall D ]

Abstract
Neural networks derived from the M-P model have excelled in various visual tasks. However, as a simplified simulation version of the brain neural pathway, their structures are locked during training, causing over-fitting and over-parameterization. Although recent models have begun using the biomimetic concept and empirical pruning, they still result in irrational pruning, potentially affecting the accuracy of the model. In this paper, we introduce the Glia unit, composed of oligodendrocytes (Oli) and astrocytes (Ast), to emulate the exact workflow of the mammalian brain, thereby enhancing the biological plausibility of neural functions. Oli selects neurons involved in signal transmission during neural communication and, together with Ast, adaptively optimizes the neural structure. Specifically, we first construct the artificial Glia-Neuron (G-N) model, which is formulated at the instance, group, and interaction levels with adaptive and collaborative mechanisms. Then, we construct GliaNet based on our G-N model, whose structure and connections can be continuously optimized during training. Experiments show that our GliaNet advances state-of-the-art on multiple tasks while significantly reducing its parameters.
Poster
Quentin Bouniot · Ievgen Redko · Anton Mallasto · Charlotte Laclau · Oliver Struckmeier · Karol Arndt · Markus Heinonen · Ville Kyrki · Samuel Kaski

[ ExHall D ]

Abstract
In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width -- common factors associated with their expressive power -- may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport mappings, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications.
Poster
Ali Hatamizadeh · Jan Kautz

[ ExHall D ]

Abstract
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://anonymous.4open.science/r/mamba_vision-D073
Poster
Qihang Fan · Huaibo Huang · Ran He

[ ExHall D ]

Abstract
The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the kv buffer and the output features. Consequently, we introduce **Rank-Augmented Linear Attention** (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the **Rank-Augmented Vision Linear Transformer** (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an **84.4%** Top-1 accuracy on ImageNet-1k with only **26M** parameters and **4.6G** FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA.
Poster
Dachong Li · li li · zhuangzhuang chen · Jianqiang Li

[ ExHall D ]

Abstract
Large kernels play a crucial role in enhancing the performance of standard convolutional neural networks (CNNs), enabling CNNs to outperform transformer architectures in computer vision. Scaling up kernel size has significantly contributed to the advancement of CNN models like RepLKNet, SLaK and UniRepLKNet. However, the relationship between kernel size and model performance varies across these work. It implies that large kernel convolution may involve hidden factors that affect model performance. Instead of merely increasing the kernel size, we reassess the role of large convolutions and decompose them into two separate components: extracting features at a certain granularity and fusing features by multiple pathways. In this paper, we contribute from two aspects. 1) We demonstrate that 3×3 convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects. 2) We develop a multi-path long-distance sparse dependency relationship to enhance feature utilization. Specifically, we introduce the Shiftwise (SW) convolution operator, a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. Code and all the models at \url{https://anonymous.4open.science/r/shift-wiseConv-8978}.
Poster
Zelin Peng · Yu Huang · Zhengqin Xu · feilong tang · Ming Hu · Xiaokang Yang · Wei Shen

[ ExHall D ]

Abstract
Contextual modeling is crucial for robust visual representation learning, especially in computer vision. Although Transformers have become a leading architecture for vision tasks due to their attention mechanism, the quadratic complexity of full attention operations presents substantial computational challenges. To address this, we introduce Star with Bilinear Mapping (SBM), a Transformer-like architecture that achieves global contextual modeling with linear complexity. SBM employs a bilinear mapping module (BM) with low-rank decomposition strategy and star operations (element-wise multiplication) to efficiently capture global contextual information. Our model demonstrates competitive performance on image classification and semantic segmentation tasks, delivering significant computational efficiency gains compared to traditional attention-based models.
Poster
Tommie Kerssies · Niccolò Cavagnero · Alexander Hermans · Narges Norouzi · Giuseppe Averta · Bastian Leibe · Gijs Dubbelman · Daan de Geus

[ ExHall D ]

Abstract
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. Currently, to apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that leverages them to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Leveraging these findings, we introduce the Encoder-only Mask Transformer, which repurposes the plain ViT architecture to conduct image segmentation. Using large models and strong pre-training, EoMT obtains a segmentation performance similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4× faster using ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation performance and inference speed, suggesting that compute resources are better allocated to scaling the ViT itself rather than adding architectural complexity. Code will be released upon acceptance.
Poster
Jiahao He · Keren Fu · Xiaohong Liu · Qijun Zhao

[ ExHall D ]

Abstract
Existing salient object detection (SOD) models primarily resort to convolutional neural networks (CNNs) and Transformers. However, the limited receptive fields of CNNs and quadratic computational complexity of transformers both constrain the performance of current models on discovering attention-grabbing objects. The emerging state space model, namely Mamba, has demonstrated its potential to balance global receptive fields and computational complexity. Therefore, we propose a novel unified framework based on the pure Mamba architecture, dubbed saliency Mamba (Samba), to flexibly handle general SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), and RGB-D VSOD. Specifically, we rethink Mamba's scanning strategy from the perspective of SOD, and identify the importance of maintaining spatial continuity of salient patches within scanning sequences. Based on this, we propose a saliency-guided Mamba block (SGMB), incorporating a spatial neighboring scanning (SNS) algorithm to preserve spatial continuity of salient patches. Additionally, we propose a context-aware upsampling (CAU) method to promote hierarchical feature alignment and aggregations by modeling contextual dependencies. Experimental results show that our Samba outperforms existing methods across five SOD tasks on 21 datasets with lower computational cost, confirming the superiority of introducing Mamba to the SOD areas. Our code will be made publicly available.
Poster
Pei Geng · Jian Yang · Shanshan Zhang

[ ExHall D ]

Abstract
Human-Object Interaction (HOI) detection aims to predict the <Human, Interaction, Object> triplets, where the core challenge lies in recognizing the interaction of each human-object pair. Despite recent progress thanks to more advanced model architectures, HOI performance remains unsatisfactory. In this work, we first perform some failure analysis and find that the accuracy for the no-interaction category is extremely low, largely hindering the improvement of overall performance. We further look into the error types and find the mis-classification between no-interaction and with-interaction ones can be handled by human-object relation priors. Specifically, to better distinguish no-interaction from direct interactions, we propose 3D location prior, which indicates the distance between human and object; as of no-interaction vs. indirect interactions, we propose gaze area prior, which denotes whether human can see the object or not. The above two types of human-object relation priors are represented by text and are combined with the original visual features, generating multi-modal cues for interaction recognition.Experimental results on the HICO-DET and V-COCO datasets demonstrate that our proposed human-object relation priors are effective and our method HORP surpasses previous methods under various settings and scenarios. In particular, the usage of our priors significantly enhances the model's recognition ability for the no-interaction …</$human,>
Poster
Yifei Qian · Zhongliang Guo · Bowen Deng · Chun Tong Lei · Shuai Zhao · Chun Pong Lau · Xiaopeng Hong · Michael Pound

[ ExHall D ]

Abstract
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a one-step diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code will be made publicly available.
Poster
Ziyu Zhao · Xiaoguang Li · Lingjia Shi · Nasrin Imanpour · Song Wang

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy.To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.
Poster
Srinivasa Rao Nandam · Sara Atito · Zhenhua Feng · Josef Kittler · Muhammad Awais

[ ExHall D ]

Abstract
Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6\% (1-way 1-shot) and +8.0/6.5\% (2-way 1-shot) on COCO-20i, and +2.2/3.8\% (1-way 1-shot) and +4.3/4.0\% (2-way 1-shot) on Pascal-5i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. …
Poster
Guoyu Yang · Yuan Wang · Daming Shi · Yanzhong Wang

[ ExHall D ]

Abstract
Recent real-time semantic segmentation models, whether single-branch or multi-branch, achieve good performance and speed. However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi-convolutions and horizontal multi-paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self-enlarge during training and self-contract during inference, effectively becoming a teacher model" without needing external ones. Experimental results show that GCNet outperforms existing state-of-the-art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at x.
Poster
Hyeokjun Kweon · Kuk-Jin Yoon

[ ExHall D ]

Abstract
Instance segmentation traditionally relies on dense pixel-level annotations, making it costly and labor-intensive. To alleviate this burden, weakly supervised instance segmentation utilizes cost-effective weak labels, such as image-level tags, points, and bounding boxes. However, existing approaches typically focus on a single type of weak label, overlooking the cost-efficiency potential of combining multiple types. In this paper, we introduce WISH, a novel heterogeneous framework for weakly supervised instance segmentation that integrates diverse weak label types within a single model. WISH unifies heterogeneous labels by leveraging SAM’s prompt latent space through a multi-stage matching strategy, effectively compensating for the lack of spatial information in class tags. Extensive experiments on Pascal VOC and COCO demonstrate that our framework not only surpasses existing homogeneous weak supervision methods but also achieves superior results in heterogeneous settings with equivalent annotation costs.
Poster
Can Küçüksözen · Yucel Yemez

[ ExHall D ]

Abstract
We propose the Compact Clustering Attention (COCA) layer, an effective building block that introduces a hierarchical strategy for object-centric representation learning while solving the unsupervised object discovery task on single images. COCA is an attention-based clustering module capable of extracting object-centric representations from multi-object scenes, when cascaded into a bottom-up hierarchical network architecture, referred to as COCA-Net. At its core, COCA utilizes a novel clustering algorithm that leverages the physical concept of compactness to highlight distinct object centroids in a scene, providing a spatial inductive bias. Thanks to this strategy, COCA-Net generates high-quality segmentation masks on both the decoder side and, notably, the encoder side of its pipeline. Additionally, COCA-Net is not bound by a predetermined number of object masks that it generates and handles the segmentation of background elements better than its competitors. We demonstrate COCA-Net's segmentation performance on six widely adopted datasets, achieving superior or competitive results against the state-of-the-art models across nine different evaluation metrics.
Poster
Mingfu Liang · Jiahuan Zhou · Xu Zou · Ying Wu

[ ExHall D ]

Abstract
Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called \textbf{I}ncremental object \textbf{K}eypoint \textbf{L}earning~(IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first \textit{Knowledge Association} stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second \textit{Mutual Promotion} stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method …
Poster
Shuo Li · Fang Liu · Zehua Hao · Xinyi Wang · Lingling Li · Xu Liu · Puhua Chen · Wenping Ma

[ ExHall D ]

Abstract
With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. First, MAF extracts features from different levels of the CLIP image encoder and fuses them uniformly to enhance feature representation. Second, ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results on multiple benchmarks show that our method can significantly improve the classification performance and alleviate the category confusion problem.
Poster
Luyao Tang · Chaoqi Chen · Yuxuan Yuan · Zeyu Zhang · Yue Huang · Kun Zhang

[ ExHall D ]

Abstract
Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret, and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.
Poster
Lei-Lei Ma · Shuo Xu · Ming-Kun Xie · Lei Wang · Dengdi Sun · Haifeng Zhao

[ ExHall D ]

Abstract
Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-level visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group level to the label level, adaptively obtaining label-level representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed …
Poster
Qiyuan Dai · Hanzhuo Huang · Yu Wu · Sibei Yang

[ ExHall D ]

Abstract
Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.
Poster
Xing Xi · Yangyang Huang · Ronghua Luo · Yu Qiu

[ ExHall D ]

Abstract
Open world perception expands traditional closed-set frameworks, which assume a predefined set of known categories, to encompass dynamic real-world environments. Open World Object Detection (OWOD) and Open Vocabulary Object Detection (OVD) are two main research directions, each addressing unique challenges in dynamic environments. However, existing studies often focus on only one of these tasks, leaving the combined challenges of OWOD and OVD largely underexplored. In this paper, we propose a novel detector, OW-OVD, which inherits the zero-shot generalization capability of OVD detectors while incorporating the ability to actively detect unknown objects and progressively optimize performance through incremental learning, as seen in OWOD detectors. To achieve this, we start with a standard OVD detector and adapt it for OWOD tasks. For attribute selection, we propose the Visual Similarity Attribute Selection (VSAS) method, which identifies the most generalizable attributes by computing similarity distributions across annotated and unannotated regions. Additionally, to ensure the diversity of attributes, we incorporate a similarity constraint in the iterative process. Finally, to preserve the standard inference process of OVD, we propose the Hybrid Attribute-Uncertainty Fusion (HAUF) method. This method combines attribute similarity with known class uncertainty to infer the likelihood of an object belonging to an unknown class. …
Poster
Haochen Li · Rui Zhang · Hantao Yao · Xin Zhang · Yifan Hao · Xinkai Song · Shaohui Peng · Yongwei Zhao · Zhao Chen · Yanjun Wu · Ling Li

[ ExHall D ]

Abstract
Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. Traditional works focus on aligning visual features between domains to extract domain-invariant knowledge, and recent VLM-based DAOD methods leverage semantic information provided by the textual encoder to supplement domain-specific features for each domain.However, they overlook the role of semantic information in guiding the learning of visual features that are beneficial for adaptation.To solve the problem, we propose semantic entropy to quantify the semantic information contained in visual features, and design SEmantic ENtropy guided Domain-aware Attention (SEEN-DA) to adaptively refine visual features with the semantic information of two domains.Semantic entropy reflects the importance of features based on semantic information, which can serve as attention to select discriminative visual features and suppress semantically irrelevant redundant information.Guided by semantic entropy, we introduce domain-aware attention modules into the visual encoder in SEEN-DA.It utilizes an inter-domain attention branch to extract domain-invariant features and eliminate redundant information, and an intra-domain attention branch to supplement the domain-specific semantic information discriminative on each domain.Comprehensive experiments validate the effectiveness of SEEN-DA, demonstrating significant improvements in cross-domain object detection performance.
Poster
Zhaohu Xing · Lihao Liu · Yijun Yang · Hongqiu Wang · Tian Ye · Sixiang Chen · Wenxue Li · Guang Liu · Lei Zhu

[ ExHall D ]

Abstract
Mirror detection is a challenging task because a mirror's visual appearance varies depending on the reflected content. Due to limited annotated data, current methods failed to generalize well for detecting diverse mirror scenes. Semi-supervised learning with large-scale unlabeled data can improve generalization capabilities on mirror detection, but these methods often suffer from unreliable pseudo-labels due to distribution differences between labeled and unlabeled data, therefore affecting the learning process. To address this issue, we first collect a large-scale dataset of approximately 0.4 million mirror-related images from the internet, significantly expanding the data scale for mirror detection. To effectively exploit this unlabeled dataset, we propose the first semi-supervised framework (namely an iterative data engine) consisting of four steps: (1) mirror detection model training, (2) pseudo label prediction, (3) dual guidance scoring, and (4) selection of highly reliable pseudo labels. In each iteration of the data engine, we employ a geometric accuracy scoring approach to assess pseudo labels based on multiple segmentation metrics, and design a multi-modal agent-driven semantic scoring approach to enhance the semantic perception of pseudo labels. These two scoring approaches can effectively improve the reliability of pseudo labels by selecting unlabeled samples with higher scores. Our method demonstrates promising performance …
Poster
Beier Zhu · Jiequan Cui · Hanwang Zhang · Chi Zhang

[ ExHall D ]

Abstract
While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach--Project-Probe-Aggregate (PPA)--that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm.Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.
Poster
Kai Zhao · zhihao zhuang · Miao Zhang · Chenjuan Guo · Yang Shu · Bin Yang

[ ExHall D ]

Abstract
Model quantization is an effective way to compress deep neural networks and accelerate the inference time on edge devices. Existing quantization methods usually require original data for calibration during the compressing process, which may be inaccessible due to privacy issues. A common way is to generate calibration data to mimic the origin data. However, the generators in these methods have the mode collapse problem, making them unable to synthesize diverse data. To solve this problem, we leverage the information from the full-precision model and enhance both inter-class and intra-class diversity for generating better calibration data, by devising a multi-layer features mixer and normalization flow based attention. Besides, novel regulation losses are proposed to make the generator produce diverse data with more patterns from the perspective of activated feature values and for the quantized model to learn better clip ranges adaptive to our diverse calibration data. Extensive experiments show that our method achieves state-of-the-art quantization results for both Transformer and CNN architectures. In addition, we visualize the generated data to verify that our strategies can effectively handle the mode collapse issue. Our codes are available at https://anonymous.4open.science/r/DFQ-84E6 and will be publicly available.
Poster
Zhou Yang · Mingtao Feng · Tao Huang · Fangfang Wu · Weisheng Dong · Xin Li · Guangming Shi

[ ExHall D ]

Abstract
Recent approaches, such as data augmentation, adversarial training, and transfer learning, have shown potential in addressing the issue of performance degradation caused by distributional shifts. However, they typically demand careful design in terms of data or models and lack awareness of the impact of distributional shifts. In this paper, we observe that classification errors arising from distribution shifts tend to cluster near the true values, suggesting that misclassifications commonly occur in semantically similar, neighboring categories. Furthermore, robust advanced vision foundation models maintain larger inter-class distances while preserving semantic consistency, making them less vulnerable to such shifts. Building on these findings, we propose a new method called GFN (Gain From Neighbors), which uses gradient priors from neighboring classes to perturb input images and incorporates an inter-class distance-weighted loss to improve class separation. This approach encourages the model to learn more resilient features from data prone to errors, enhancing its robustness against shifts in diverse settings. In extensive experiments across various model architectures and benchmark datasets, GFN consistently demonstrated superior performance. For instance, compared to the current state-of-the-art TAPADL method, our approach achieved a higher corruption robustness of 41.4% on ImageNet-C (+2.3%), without requiring additional parameters and using only minimal data.
Poster
HAN SUN · Yunkang Cao · Hao Dong · Olga Fink

[ ExHall D ]

Abstract
Visual anomaly detection (AD) presents significant challenges due to the scarcity of anomalous data samples. While numerous works have been proposed to synthesize anomalous samples, these synthetic anomalies often lack authenticity or require extensive training data, limiting their applicability in real-world scenarios.In this work, we propose Anomaly Anything (AnomalyAny), a novel framework that leverages Stable Diffusion (SD)'s image generation capabilities to generate diverse and realistic unseen anomalies. By conditioning on a single normal sample during test time, AnomalyAny is able to generate unseen anomalies for arbitrary object types with text descriptions. Within AnomalyAny, we propose attention-guided anomaly optimization to direct SD’s attention on generating hard anomaly concepts. Additionally, we introduce prompt-guided anomaly refinement, incorporating detailed descriptions to further improve the generation quality. Extensive experiments on MVTec AD and VisA datasets demonstrate AnomalyAny's ability in generating high-quality unseen anomalies and its effectiveness in enhancing downstream AD performance.
Poster
Lei Fan · Dongdong Fan · Zhiguang Hu · Yiwen Ding · Donglin Di · Kai Yi · Maurice Pagnucco · Yang Song

[ ExHall D ]

Abstract
We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for < what, why, how>, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.
Poster
Shilhora Akshay · Niveditha Lakshmi Narasimhan · Jacob George · Vineeth Balasubramanian

[ ExHall D ]

Abstract
Anomaly detection and localization remain pivotal challenges in computer vision, with applications ranging from industrial inspection to medical diagnostics. While current supervised methods offer high precision, they are often impractical due to the scarcity of annotated data and the infrequent occurrence of anomalies. Recent advancements in unsupervised approaches, particularly reconstruction-based methods, have addressed these issues by training models exclusively on normal data, enabling them to identify anomalies during inference. However, these methods frequently rely on auxiliary networks or specialized adaptations, which can limit their robustness and practicality. This work introduces the Latent Anomaly Schrodinger Bridge (LASB), a unified unsupervised anomaly detection model that operates entirely in the latent space without requiring additional networks or custom modifications. LASB transforms anomaly images into normal images by preserving structural integrity across varying anomaly classes, lighting, and pose conditions, making it highly robust and versatile. Unlike previous methods, LASB does not focus solely on reconstructing anomaly features but emphasizes anomaly transformation, achieving smooth anomaly-to-normal image conversions. Our method achieves state-of-the-art performance on both the MVTec-AD and VisA datasets, excelling in detection and localization tasks.
Poster
Yoon Gyo Jung · Jaewoo Park · Jaeho Yoon · Kuan-Chuan Peng · Wonchul Kim · Andrew Beng Jin Teoh · Octavia Camps

[ ExHall D ]

Abstract
We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.
Poster
Shubhang Bhatnagar · Narendra Ahuja

[ ExHall D ]

Abstract
Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model that instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.
Poster
Yanghao Wang · Long Chen

[ ExHall D ]

Abstract
Data Augmentation (DA), i.e., synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve the performance of various data-scarce tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different image classification benchmarks. In this paper, we analyze today's diffusion-based DA methods, and argue that they cannot take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting classification performance. To this end, we propose a novel Diffusion-based DA method: Diff-II. Specifically, it consists of three steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting circle interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on various data-scarce image classification tasks (e.g., few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.
Poster
Shaobo Wang · Yicun Yang · Zhiyuan Liu · Chenghao Sun · Xuming Hu · Conghui He · Linfeng Zhang

[ ExHall D ]

Abstract
Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF's frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (NCFM), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5\% accuracy boost on ImageSquawk. Our method also reduces GPU memory …
Poster
Wenliang Zhong · Haoyu Tang · Qinghai Zheng · Mingzhu Xu · Yupeng Hu · Weili Guan

[ ExHall D ]

Abstract
The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT creates convex combinations of expert trajectories by selecting a few expert models, guiding student networks to converge quickly and stably. This trajectory is not only easier to store, but also enables continuous sampling strategies during the distillation process, ensuring thorough learning and fitting of the entire expert trajectory. The comprehensive …
Poster
Felipe del Rio · Alain Raymond · Daniel Florea · Rodrigo Toro Icarte · Julio Hurtado · Cristian Buc Calderon · Alvaro Soto

[ ExHall D ]

Abstract
Deep neural networks (DNNs) struggle at systematic generalization (SG). Several studies have evaluated the possibility to promote SG through the proposal of novel architectures, loss functions or training methodologies. Few studies, however, have focused on the role of training data properties in promoting SG. In this work, we investigate the impact of certain data distributional properties, as inductive biases for the SG ability of a multi-modal language model. To this end, we study three different properties. First, data diversity, instantiated as an increase in the possible values a latent property in the training distribution may take. Second, burstiness, where we probabilistically restrict the number of possible values of latent factors on particular inputs during training. Third, latent intervention, where a particular latent factor is altered randomly during training. We find that all three factors significantly enhance SG, with diversity contributing an 89\% absolute increase in accuracy in the most affected property. Through a series of experiments, we test various hypotheses to understand why these properties promote SG. Finally, we find that Normalized Mutual Information (NMI) between latent attributes in the training distribution is strongly predictive of out-of-distribution generalization. We find that a mechanism by which lower NMI induces SG is …
Poster
Seokju Yun · Seunghye Chae · Dongheon Lee · Youngmin Ro

[ ExHall D ]

Abstract
Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Low-Rank Adaptation (SoRA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoRA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Furthermore, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoRA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of …
Poster
Hao Zhu · Yifei Zhang · Junhao Dong · Piotr Koniusz

[ ExHall D ]

Abstract
Continual learning requires models to learn tasks sequentially while maintaining a delicate balance between stability (retaining knowledge of previous tasks) and plasticity (adapting to new tasks). A key challenge is preventing interference between tasks - where learning new tasks degrades performance on previously learned ones. Recent approaches have leveraged parameter-efficient fine-tuning (PEFT) methods, which adapt pre-trained models by injecting a small number of learnable parameters. However, existing PEFT-based continual learning methods like InfLoRA face fundamental limitations: they rely on complex optimization procedures to learn orthogonal task-specific spaces, and finding such spaces becomes increasingly difficult as tasks accumulate. We propose a novel bilinear reformulation that fundamentally reimagines task separation through fixed orthogonal bases. Our key insight is that by expanding the parameter space quadratically through two fixed bases, we can achieve "almost orthogonal" task subspaces probabilistically, eliminating the need for explicit interference elimination procedures. We provide theoretical guarantees that this approach reduces the probability of task interference from \bigO((k/d)2) to \bigO((k/d2)2), ensuring reliable task separation without complex optimization. Through extensive experiments on ImageNet-R, CIFAR100, and DomainNet, we validate our theoretical bounds and demonstrate state-of-the-art performance with reduced parameter count.
Poster
Haoyang Li · Liang Wang · Chao Wang · Jing Jiang · Yan Peng · Guodong Long

[ ExHall D ]

Abstract
The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve …
Poster
Guowei Wang · Changxing Ding

[ ExHall D ]

Abstract
Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs. This paper's code will be released.
Poster
Ye Liu · Meng Yang

[ ExHall D ]

Abstract
Few-shot class-incremental learning (FSCIL) presents a significant challenge in machine learning, requiring models to integrate new classes from limited examples while preserving performance on previously learned classes. Recently, prompt-based CIL approaches leverage ample data to train prompts, effectively mitigating catastrophic forgetting. However, these methods do not account for the semantic features embedded in prompts, exacerbating the plasticity-stability dilemma in few-shot incremental learning. In this paper, we propose a novel and simple framework named SEmantic Complementary Prompt(SEC-Prompt), which learns two sets of semantically complementary prompts based on an adaptive query: discriminative prompts(D-Prompt) and non-discriminative prompts(ND-Prompt). D-Prompt enhances the separation of class-specific feature distributions by strengthening key discriminative features, while ND-Prompt balances non-discriminative information to promote generalization to novel classes. To efficiently learn high-quality knowledge from limited samples, we leverage ND-Prompt for data augmentation to increase sample diversity and introduce Prompt Clustering Loss to prevent noise contamination in D-Prompt, ensuring robust discriminative feature learning and improved generalization. Our experimental results showcase state-of-the-art performance across four benchmark datasets, including CIFAR100, ImageNet-R and CUB datasets.
Poster
Li-Jun Zhao · Zhen-Duo Chen · Yongxin Wang · Xin Luo · Xin-Shun Xu

[ ExHall D ]

Abstract
Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn novel classes with limited samples after pre-training on a set of base classes. To avoid catastrophic forgetting and overfitting, most FSCIL methods first train the model on the base classes and then freeze the feature extractor in the incremental sessions. However, the reliance on nearest neighbor classification makes FSCIL prone to the hubness phenomenon, which negatively impacts performance in this dynamic and open scenario. While recent methods attempt to adapt to the dynamic and open nature of FSCIL, they are often limited to biased optimizations to the feature space. In this paper, we pioneer the theoretical analysis of the inherent hubness in FSCIL. To mitigate the negative effects of hubness, we propose a novel Attraction Diminishing and Distributing (D2A) method from the essential perspectives of distance metric and feature space. Extensive experimental results demonstrate that our method can broadly and significantly improve the performance of existing methods.
Poster
Kai Fang · Anqi Zhang · Guangyu Gao · Jianbo Jiao · Chi Harold Liu · Yunchao Wei

[ ExHall D ]

Abstract
Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous (old) and incremental (new) classes.To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization (CoMBO).Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class-specific adapters.Moreover, we develop two strategies to further mitigate the conflict following the branched structure, i.e., the Half-Learning Half-Distillation (HDHL) over classification probabilities, and the Importance-Based Knowledge Distillation (IKD) over query features.HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples .Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve.Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. The code is available in the …
Poster
Yanbiao Ma · Wei Dai · Wenke Huang · Jiayi Chen

[ ExHall D ]

Abstract
Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence.
Poster
Sebastian Schmidt · Leonard Schenk · Leo Schwinn · Stephan Günnemann

[ ExHall D ]

Abstract
As the data demand for deep learning models increases, active learning becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs.Recent work addresses important real-world considerations of active learning, such as handling out-of-distribution (OOD) data and online discovery of novel object categories. However, a combined analysis of these scenarios remains unexplored.To address this gap regarding real-world considerations, we propose a novel scenario, Open-Set Discovery Active Learning (OSDAL), which integrates OOD sample handling and novel category discovery.In contrast to previous methods, we construct a common feature space within a single model that aligns known and novel categories while separating OOD samples.This enables our approach, Joint Out-of-distribution filtering and data Discovery Active learning (Joda), to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling.Unlike previous work, Joda does not require auxiliary detection models for filtering or selection and is, therefore, effectively reducing the computational overhead.In extensive experiments on 15 configurations and 3 metrics, Joda achieves consistently the highest or equally high accuracy as state-of-the-art competitor approaches in 39 out of 45 cases.
Poster
Ronghang Zhu · Mengxuan Hu · Weiming Zhuang · Lingjuan Lyu · Xiang Yu · Sheng Li

[ ExHall D ]

Abstract
Domain adaptation addresses the challenge where the distribution of target inference data differs from that of the source training data. Recently, data privacy has become a significant constraint, limiting access to the source domain. To mitigate this issue, Source-Free Domain Adaptation (SFDA) methods bypass source domain data by generating source-like data or pseudo-labeling the unlabeled target domain. However, these approaches often lack theoretical grounding. In this work, we provide a theoretical analysis of the SFDA problem, focusing on the general empirical risk of the unlabeled target domain. Our analysis offers a comprehensive understanding of how representativeness, generalization, and variety contribute to controlling the upper bound of target domain empirical risk in SFDA settings. We further explore how to balance this trade-off from three perspectives: sample selection, semantic domain alignment, and a progressive learning framework. These insights inform the design of novel algorithms. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on three benchmark datasets—Office-Home, DomainNet, and VisDA-C—yielding relative improvements of 3.2%, 9.1%, and 7.5%, respectively, over the representative SFDA method, SHOT.
Poster
Junyi Chai · Shenyu Lu · Xiaoqian Wang

[ ExHall D ]

Abstract
Multi-task learning (MTL) is a paradigm that aims to improve the generalization of models by simultaneously learning multiple related tasks, leveraging shared representations and task-specific information to capture complex patterns and to enhance performance on individual tasks. However, existing work has discovered that MTL could possibly harm generalization, and one particular reason is the spurious correlations between tasks, where owing to the knowledge-sharing property, the task-specific predictors are more likely to develop reliance on spurious features. Most existing approaches address this issue through distributional robustness, aiming to maintain consistent performance across different distributions under unknown covariate shifts. Yet, this formulation lacks theoretical guarantee and can be sensitive to the construction of covariate shift. In this work, we propose a novel perspective, where we seek to directly identify the spurious correlations between tasks. Drawing inspirations from conventional formulations on spurious correlation, for each task, we propose to distinguish its spurious tasks using the difference in correlation coefficients between the empirical distribution and class-wise resampled distributions, thereby capturing the correlations between task labels w.r.t. each class. We prove theoretically the feasibility of such resampling strategy in characterizing the spurious correlation between tasks. Following the identification of task-specific spurious information, we propose a …
Poster
Na Zheng · Xuemeng Song · Xue Dong · Aashish Nikhil Ghosh · Liqiang Nie · Roger Zimmermann

[ ExHall D ]

Abstract
Recent studies have focused on introducing pre-trained foundation models into semi-supervised learning (SSL) tasks. Nevertheless, these foundation models can exhibit biases toward different classes and tend to generate imbalanced pseudo-labels for SSL. Thus, efforts have been made to introduce the logit adjustment offset to reduce the inherent bias in foundation models for SSL tasks.Despite their success, existing foundation model-based SSL methods face challenges: 1) unreliability in the estimated logit adjustment offset, 2) overlooking the potential of linguistic knowledge in capturing model biases and 3) fail to fully exploit the unlabeled samples. To address these issues, we propose Language-Assisted Debiasing and Smoothing framework, namely LADaS, for foundation model-based SSL. It consists of two components: 1) Language-assisted Pseudo-Label Debiasing (LPLD) to reduce biases in foundation models, and 2) Language-aware Pseudo-Label Smoothing (LPLS) to fully exploit low-confidence samples to facilitate SSL training. In particular, LPLD introduces a reliability score to dynamically assess the reliability of the logit adjustment. Additionally, it incorporates a language-oriented preference to reduce model biases using linguistic knowledge derived from pre-trained language models. Finally, LPLS introduces language-aware soft labels and devises language-aware pseudo-label smoothing loss to guide the learning of unlabeled samples with low-quality pseudo-labels. Extensive experiments demonstrate the superiority …
Poster
Lilin Zhang · Chengpei Wu · Ning Yang

[ ExHall D ]

Abstract
The existing adversarial training (AT) methods often suffer from incomplete perturbation, i.e., not all non-robust features are perturbed during the generation of AEs, which causes remaining correlations of non-robust features with labels captured by the target model, i.e., suboptimal learning of robust features. However, fulfilling complete perturbation, i.e., perturbing as many non-robust features as possible, is not easy due to the challenges of unidentifiability of robust/non-robust features and the sparsity of labeled data. To overcome these challenges, we propose a novel solution called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT fulfills complete perturbation for better learning of robust features by blocking the correlations between non-robust features and labels, via complete AE generation over partially labeled data grounded in information theory. The solid theoretical analysis and the extensive experiments conducted on widely adopted benchmarks verify the superiority of WSCAT.
Poster
Qi Chen · Hu Ding

[ ExHall D ]

Abstract
Out-of-distribution (OOD) detection is crucial for machine learning models deployed in open-world environments. However, existing methods often struggle with model over-confidence or rely heavily on empirical energy value estimation, limiting their scalability and generalizability. This paper introduces DEBO (Dual Energy-Based Model for Out-of-distribution Detection), a novel approach that addresses these limitations through an innovative dual classifier architecture and a unified energy-based objective function. DEBO enhances the standard classification framework by integrating a dual-purpose output space within a single classifier. The primary component classifies in-distribution (ID) data conventionally, while the secondary component captures open-world information and estimates uncertainty. Our method overcomes the dependence of traditional energy model-based OOD detection methods on empirical energy estimation while maintaining theoretical guarantees. Theoretical analysis demonstrates that DEBO promotes low energy and high confidence for ID data, while simultaneously inducing higher energy and decreased confidence for OOD samples. Extensive experiments conducted on benchmark datasets reveal that DEBO achieves state-of-the-art OOD detection performance while maintaining comparable classification accuracy on ID data.
Poster
Senyu Hou · Gaoxia Jiang · Jia Zhang · Shangrong Yang · Husheng Guo · Yaqing Guo · Wenjian Wang

[ ExHall D ]

Abstract
In image classification, the label quality of training data critically influences model generalization, especially for deep neural networks (DNNs). Traditionally, learning from noisy labels (LNL) can improve the generalization of DNNs through complex architectures or a series of robust techniques, but its performance improvement is limited by the discriminative paradigm. Unlike traditional ways, we resolve the LNL problems from the perspective of robust label generation, based on diffusion models within the generative paradigm. To expand the diffusion model into a robust classifier that explicitly accommodates more noise knowledge, we propose a Directional Label Diffusion (DLD) model. It disentangles the diffusion process into two paths, i.e., directional diffusion and random diffusion. Specifically, directional diffusion simulates the corruption of true labels into a directed noise distribution, prioritizing the removal of likely noise, whereas random diffusion introduces inherent randomness to support label recovery. This architecture enable DLD to gradually infer labels from an initial random state, interpretably diverging from the specified noise distribution. To adapt the model to diverse noisy environments, we design a low-cost label pre-correction method that automatically supplies more accurate label information to the diffusion model, without requiring manual intervention or additional iterations. In addition, we optimize the paradigm for …
Poster
Yunlu Yan · Huazhu Fu · Yuexiang Li · Jinheng Xie · Jun Ma · Guang Yang · Lei Zhu

[ ExHall D ]

Abstract
Federated Learning (FL) facilitates collaborative learning among multiple clients in a distributed manner and ensures the security of privacy. However, its performance inevitably degrades with non-Independent and Identically Distributed (non-IID) data. In this paper, we focus on the feature distribution skewed FL scenario, a common non-IID situation in real-world applications where data from different clients exhibit varying underlying distributions. This variation leads to feature shift, which is a key issue of this scenario. While previous works have made notable progress, few pay attention to the data itself, i.e., the root of this issue. The primary goal of this paper is to mitigate feature shift from the perspective of data. To this end, we propose a simple yet remarkably effective input-level data augmentation method, namely FedRDN, which randomly injects the statistical information of the local distribution from the entire federation into the client's data. This is beneficial to improve the generalization of local feature representations, thereby mitigating feature shift. Moreover, our FedRDN is a plug-and-play component, which can be seamlessly integrated into the data augmentation flow with only a few lines of code. Extensive experiments on several datasets show that the performance of various representative FL methods can be further improved …
Poster
Yasser Khalil · Leo Maxime Brunswic · Soufiane Lamghari · Xu Li · Mahdi Beitollahi · Xi Chen

[ ExHall D ]

Abstract
Federated unlearning (FU) aims to remove a participant’s data contributions from a trained federated learning (FL) model, ensuring privacy and regulatory compliance. Traditional FU methods often depend on auxiliary storage on either the client or server side or require direct access to the data targeted for removal—a dependency that may not be feasible if the data is no longer available. To overcome these limitations, we propose NoT, a novel and efficient FU algorithm based on weight negation (multiplying by -1), which circumvents the need for additional storage and access to the target data. We argue that effective and efficient unlearning can be achieved by perturbing model parameters away from the set of optimal parameters, yet being well-positioned for quick re-optimization. This technique, though seemingly contradictory, is theoretically grounded: we prove that the weight negation perturbation effectively disrupts inter-layer co-adaptation, inducing unlearning while preserving an approximate optimality property, thereby enabling rapid recovery. Experimental results across three datasets and three model architectures demonstrate that NoT significantly outperforms existing baselines in unlearning efficacy as well as in communication and computational efficiency.
Poster
Ye Li · Yanchao Zhao · chengcheng zhu · Jiale Zhang

[ ExHall D ]

Abstract
Federated Learning (FL), a privacy-preserving decentralized machine learning framework, has been shown to be vulnerable to backdoor attacks. Current research primarily focuses on the Single-Label Backdoor Attack (SBA), wherein adversaries share a consistent target. However, a critical fact is overlooked: adversaries may be non-cooperative, have distinct targets, and operate independently, which exhibits a more practical scenario called Multi-Label Backdoor Attack (MBA). Unfortunately, prior works are ineffective in MBA scenario since non-cooperative attackers exclude each other. In this work, we conduct an in-depth investigation to uncover the inherent constraints of the exclusion: similar backdoor mappings are constructed for different targets, resulting in conflicts among backdoor functions. To address this limitation, we propose Mirage, the first non-cooperative MBA strategy in FL that allows attackers to inject effective and persistent backdoors into the global model without collusion by constructing in-distribution (ID) backdoor mapping. Specifically, we introduce an adversarial adaptation method to bridge the backdoor features and the target distribution in an ID manner. Additionally, we further leverage a constrained optimization method to ensure the ID mapping survives in the global training dynamics. Extensive evaluations demonstrate that Mirage outperforms various state-of-the-art attacks and bypasses existing defenses, achieving an average ASR greater than 97\% and …
Poster
Dongyoon Yang · Jihu Lee · Yongdai Kim

[ ExHall D ]

Abstract
Robust domain adaptation against adversarial attacks is a critical area of research, addressing the need for models to perform consistently across diverse, challenging domains. In this paper, we derive a new generalization bound for robust risk on a target domain, using a novel divergence measure specifically tailored for robust domain adaptation. Inspired by this generalization bound, we propose a new algorithm named TAROT, which is designed to enhance domain adaptability and robustness. Additionally, we empirically demonstrate that a simple pseudo labeling approach, when combined with robust pretraining (Robust-PT), establishes a surprisingly strong baseline that surpasses traditional robust domain adaptation algorithms. Through extensive experiments, we illustrate that TAROT not only outperforms state-of-the-art methods in accuracy and robustness but also shows substantial scalability improvements. This improvements are done particularly in the challenging DomainNet benchmark dataset, emphasizing our algorithm's effectiveness and potential for broader applications.
Poster
Hanrong Zhang · Zhenting Wang · Boheng Li · Fulin Lin · Tingxu Han · Mingyu Jin · Chenlu Zhan · Mengnan Du · Hongwei Wang · Shiqing Ma

[ ExHall D ]

Abstract
Self-supervised learning (SSL) models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in SSL often involve noticeable triggers, like colored patches or visible noise, which are vulnerable to human inspection. This paper proposes an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers designed for supervised learning are less effective in compromising self-supervised models. We then identify this ineffectiveness is attributed to the overlap in distributions between the backdoor and augmented samples used in SSL. Building on this insight, we design an attack using optimized triggers disentangled with the augmented transformation in the SSL, while remaining imperceptible to human vision. Experiments on five datasets and six SSL algorithms demonstrate our attack is highly effective and stealthy. It also has strong resistance to existing backdoor defenses.
Poster
Aishik Konwer · Zhijian Yang · Erhan Bas · Cao Xiao · Prateek Prasanna · Parminder Bhatia · Taha Kass-Hout

[ ExHall D ]

Abstract
Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
Poster
Kaisheng Liang · Xuelong Dai · Yanjie Li · Dong Wang · Bin Xiao

[ ExHall D ]

Abstract
Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
Poster
Meilong Xu · Saumya Gupta · Xiaoling Hu · Chen Li · Shahira Abousamra · Dimitris Samaras · Prateek Prasanna · Chao Chen

[ ExHall D ]

Abstract
Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Fréchet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships.
Poster
Han Liu · Peng Cui · Bingning Wang · Weipeng Chen · Yupeng Zhang · Jun Zhu · Xiaolin Hu

[ ExHall D ]

Abstract
Deep Neural Networks (DNNs) have achieved remarkable success in a variety of tasks, particularly in terms of prediction accuracy. However, in real-world scenarios, especially in safety-critical applications, accuracy alone is insufficient; reliable uncertainty estimates are essential. Modern DNNs, often trained with cross-entropy loss, tend to exhibit overconfidence, especially on ambiguous samples. Many techniques aim to improve uncertainty calibration, yet they often come at the cost of reduced accuracy or increased computational demands.To address this challenge, we propose Differentiated Deep Mutual Learning (Diff-DML), an efficient ensemble approach that simultaneously enhances accuracy and uncertainty calibration. Diff-DML draws inspiration from Deep Mutual Learning (DML) while introducing two strategies to maintain prediction diversity: (1) Differentiated Training Strategy (DTS) and (2) Diversity-Preserving Learning Objective (DPLO). Our theoretical analysis shows that Diff-DML’s diversified learning framework not only leverages ensemble benefits but also avoids the loss of prediction diversity observed in traditional DML setups, which is crucial for improved calibration.Extensive evaluations on various benchmarks confirm the effectiveness of Diff-DML. For instance, on the CIFAR-100 dataset, Diff-DML on ResNet34/50 models achieved substantial improvements over the previous state-of-the-art method, MDCA, with absolute accuracy gains of 1.3%/3.1%, relative ECE reductions of 49.6%/43.8%, and relative classwise-ECE reductions of 7.7%/13.0%.
Poster
Ren Wang · Haoliang Sun · Yuxiu Lin · Chuanhui Zuo · Yongshun Gong · Yilong Yin · Wenjia Meng

[ ExHall D ]

Abstract
Multi-view representation learning integrates multiple observable views of an entity into a unified representation to facilitate downstream tasks. Current methods predominantly focus on distinguishing compatible components across views, followed by a single-step parallel fusion process. However, this parallel fusion is static in essence, overlooking potential conflicts among views and compromising representation ability. To address this issue, this paper proposes a novel \textbf{Seq}uential fusion framework for \textbf{M}ulti-\textbf{v}iew \textbf{R}epresentation \textbf{L}earning, termed \textbf{SeqMvRL}. Specifically, we model multi-view fusion as a sequential decision-making problem and construct a pairwise integrator (PI) and a next-view selector (NVS), which represent the \textit{environment} and \textit{agent} in reinforcement learning, respectively. PI merges the current fused feature with the selected view, while NVS is introduced to determine which view to fuse subsequently. By adaptively selecting the next optimal view for fusion based on the current fusion state, SeqMvRL thereby effectively reduces conflicts and enhances unified representation quality. Additionally, an elaborate novel reward function encourages the model to prioritize views that enhance the discriminability of the fused features. Experimental results demonstrate that SeqMvRL outperforms parallel fusion approaches in classification and clustering tasks.
Poster
Bowen Zhao · Qianqian Wang · Zhengming Ding · Quanxue Gao

[ ExHall D ]

Abstract
The success of existing deep multi-view graph clustering methods is based on the assumption that node attributes are fully available across all views. However, in practical scenarios, node attributes are frequently missing due to factors such as data privacy concerns or failures in data collection devices. Although some methods have been proposed to address the issue of missing node attributes, they come with the following limitations: \textit{i}) Existing methods are often not tailored specifically for clustering tasks and struggle to address missing attributes effectively. \textit{ii}) They tend to ignore the relational dependencies between nodes and their neighboring nodes. This oversight results in unreliable imputations, thereby degrading clustering performance. To address the above issues, we propose an \textbf{A}ttribute-\textbf{M}issing \textbf{M}ulti-view \textbf{G}raph \textbf{C}lustering (AMMGC). Specifically, we first impute missing node attributes by leveraging neighborhood information through an adjacency matrix. Then, to improve the consistency, we integrate a dual structure consistency module that aligns graph structures across multiple views, reducing redundancy and retaining key information. Furthermore, we introduce a high-confidence guidance module to improve the reliability of clustering. Extensive experiment results showcase the effectiveness and superiority of our proposed method on multiple benchmark datasets.
Poster
Thomas Dagès · Simon Weber · Ya-Wei Eileen Lin · Ronen Talmon · Daniel Cremers · Michael Lindenbaum · Alfred M. Bruckstein · Ron Kimmel

[ ExHall D ]

Abstract
Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean spaces, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.
Poster
Chengxiang Huang · Yake Wei · Zequn Yang · Di Hu

[ ExHall D ]

Abstract
Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities.To address this issue, we propose \textbf{I}nformation \textbf{A}cquisition \textbf{R}egulation (IAR), a method designed to balance information acquisition among modalities. Specifically, IAR slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that IAR outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance.
Poster
Guanzhou Ke · Shengfeng He · Xiao-Li Wang · Bo Wang · Guoqing Chao · Yuanyang Zhang · Yi Xie · HeXing Su

[ ExHall D ]

Abstract
Previous successful approaches to missing modality completion rely on carefully designed fusion techniques and extensive pre-training on complete data, which can limit their generalizability in out-of-domain (OOD) scenarios. In this study, we pose a new challenge: can we develop a missing modality completion model that is both resource-efficient and robust to OOD generalization? To address this, we present a training-free framework for missing modality completion that leverages large multimodal models (LMMs). Our approach, termed the "Knowledge Bridger”, is modality-agnostic and integrates generation and ranking of missing modalities. By defining domain-specific priors, our method automatically extracts structured information from available modalities to construct knowledge graphs. These extracted graphs connect the missing modality generation and ranking modules through the LMM, resulting in high-quality imputations of missing modalities. Experimental results across both general and medical domains show that our approach consistently outperforms competing methods, including in OOD generalization. Additionally, our knowledge-driven generation and ranking techniques demonstrate superiority over variants that directly employ LMMs for generation and ranking, offering insights that may be valuable for applications in other domains.
Poster
Max Gutbrod · David Rauber · Danilo Weber Nunes · Christoph Palm

[ ExHall D ]

Abstract
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The full repository is available at https://github.com/xxxx/xxx.
Poster
Mariamma Antony · Rajiv Porana · Sahil M. Lathiya · Siva Teja Kakileti · Chiranjib Bhattacharyya

[ ExHall D ]

Abstract
Mobile health (mHealth) has emerged as a transformative solution to enhance healthcare accessibility and affordability, particularly in resource-constrained regions and low-to-middle-income countries.mHealth leverages mobile platforms to improve healthcare accessibility, addressing radiologist shortages in low-resource settings by enabling remote diagnosis and consultation through mobile devices. Mobile phones allow healthcare workers to transmit radiographic images, such as chest X-rays (CXR), to specialists or AI-driven models for interpretation. However, AI-based diagnosis using CXR images shared via apps like WhatsApp suffers from reduced predictability and explainability due to compression artifacts, and there is a lack of datasets to systematically study these challenges. To address this, we introduce CheXwhatsApp, a dataset of 175,029 paired original and WhatsApp-compressed CXR images. We present a benchmarking study which shows the dataset improves prediction stability and explainability of state-of-the-art models by up to 80%, while also enhancing localization performance. CheXwhatsApp is open-sourced to support advancements in mHealth applications for CXR analysis.
Poster
Hanbin Ko · Chang Min Park

[ ExHall D ]

Abstract
The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and improving the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model’s understanding of the complexities of clinical language. Our approach integrates seamlessly into any medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification and report retrieval. To further assess our model’s capacity for clinical language comprehension, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.
Poster
Shahad Albastaki · Anabia Sohail · IYYAKUTTI IYAPPAN GANAPATHI · Basit Alawode · Asim Khan · Sajid Javed · Naoufel Werghi · Mohammed Bennamoun · Arif Mahmood

[ ExHall D ]

Abstract
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. This method aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms State-Of-The-Art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at xxx.
Poster
Tong Wang · Mingkang Wang · Zhongze Wang · Hongkai Wang · Qi Xu · Fengyu Cong · Hongming Xu

[ ExHall D ]

Abstract
Recently, virtual staining has emerged as a promising alternative to revolutionize histological staining by digitally generating stains. However, most existing methods suffer from the curse of staining unreality and unreliability. In this paper, we propose the Orthogonal Decoupling Alignment Generative Adversarial Network (ODA-GAN) for unpaired virtual immunohistochemistry (IHC) staining. Our approach is based on the assumption that an image consists of IHC staining-related features, which influence staining distribution and intensity, and staining-unrelated features, such as tissue morphology. Leveraging a pathology foundation model, we first develop a weakly-supervised segmentation pipeline as an alternative to expert annotations. We introduce an Orthogonal MLP (O-MLP) module to project image features into an orthogonal space, decoupling them into staining-related and unrelated components. Additionally, we propose a Dual-stream PatchNCE (DPNCE) loss to resolve contrastive learning contradictions in the staining-related space, thereby enhancing staining accuracy. To further improve realism, we introduce a Multi-layer Domain Alignment (MDA) module to bridge the domain gap between generated and real IHC images. Extensive evaluations on three benchmark datasets show that our ODA-GAN reaches state-of-the-art (SOTA) performance. Our source code is available at ***.
Poster
Yisi Luo · Xile Zhao · Kai Ye · Deyu Meng

[ ExHall D ]

Abstract
Spatial transcriptomics (ST) are emerging technologies that reveal spatial distributions of gene expressions within tissues, serving as important ways to uncover biological insights. However, the irregular spatial profiles and variability of genes make it challenging to integrate spatial information with gene expression under a computational framework. Current algorithms mostly utilize spatial graph neural networks to encode spatial information, which may incur increased computational costs and may not be flexible enough to depict complex spatial configurations. In this study, we introduce a concise yet effective representation framework, STINR, for deciphering ST data. STINR leverages an implicit neural representation (INR) to continuously represent ST data, which efficiently characterizes spatial and slice-wise correlations of ST data by inheriting the implicit smoothness of INR. STINR allows easier integration of multiple slices and multi-omics without any alignment, and serves as a potent tool for various biological tasks including gene imputation, gene denoising, spatial domain detection, and cell-type deconvolution stemed from ST data. In particular, STINR identifies the thinnest cortex layer in the dorsolateral prefrontal cortex which previous methods were unable to achieve, and more accurately identifies tumor regions in the human squamous cell carcinoma, showcasing its practical value for biological discoveries.
Poster
Zheng Zhang · Guanchun Yin · Bo Zhang · Wu Liu · Xiuzhuang Zhou · Wendong Wang

[ ExHall D ]

Abstract
The limited data annotations have made semi-supervised learning (SSL) increasingly popular in medical image analysis. However, the use of pseudo labels in SSL degrades the performance of decoders that heavily rely on high-accuracy annotations. This issue is particularly pronounced in class-imbalanced multi-organ segmentation tasks, where small organs may be under-segmented or even ignored. In this paper, we propose a semantic knowledge complementarity based decoupling framework for accurate multi-organ segmentation in class-imbalanced CT images. The framework decouples the data flow based on the responsibilities of the encoder and decoder during model training to make the model effectively learn semantic features, while mitigating the negative impact of unlabeled data on the semantic segmentation task. Then, we design a semantic knowledge complementarity module that adopt labeled data to guide the generation of pseudo labels and enriches the semantic features of labeled data with unlabeled data, which improves the quality of generated pseudo labels and the robustness of the overall model. Furthermore, we also design an auxiliary balanced segmentation head based training strategy to further enhance the segmentation performance of small organs. Extensive experiments on the Synapse and AMOS datasets show that our method significantly outperforms existing state-of-the-art methods.
Poster
Theodore Zhao · Sid Kiblawi · Mu Wei · Ho Hin Lee · J. Samuel Preston · Naoto Usuyama · Hoifung Poon

[ ExHall D ]

Abstract
Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1\% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations.In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy.BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
Poster
Rong Qin · Xingyu Liu · Jinglei Shi · Liang Lin · Jufeng Yang

[ ExHall D ]

Abstract
Over the last decade, significant efforts have been dedicated to designing efficient models for the challenge of ultra-high resolution (UHR) semantic segmentation. These models mainly follow the dual-stream architecture and generally fall into three subcategories according to the improvement objectives, i.e., dual-stream ensemble, selective zoom, and complementary learning. However, most of them overly concentrate on crafting complex pipelines to pursue one of the above objectives separately, limiting the model performance in both accuracy and inference consumption. In this paper, we suggest simultaneously achieving these objectives by estimating resolution-biased uncertainties in low resolution stream. Here, the resolution-biased uncertainty refers to the degree of prediction unreliability primarily caused by resolution loss from down-sampling operations. Specifically, we propose a dual-stream UHR segmentation framework, where an estimator is used to assess resolution-biased uncertainties through the entropy map and high-frequency feature residual. The framework also includes a selector, an ensembler, and a complementer to boost the model with obtained estimations. They share the uncertainty estimations as the weights to choose difficult regions as the inputs for UHR stream, perform weighted fusion between distinct streams, and enhance the learning for important pixels, respectively. Experiment results demonstrate that our method achieves a satisfactory balance between accuracy and …
Poster
Yankai Jiang · Peng Zhang · Donglin Yang · Yuan Tian · Hai Lin · Xiaosong Wang

[ ExHall D ]

Abstract
We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. The codes will be made publicly available.
Poster
Zheyu Zhang · Yayuan Lu · Feipeng Ma · Yueyi Zhang · Huanjing Yue · Xiaoyan Sun

[ ExHall D ]

Abstract
Brain tumor segmentation plays a crucial role in clinical diagnosis, yet the frequent unavailability of certain MRI modalities poses a significant challenge. In this paper, we introduce the Learnable Sorting State Space Model (LS3M), a novel framework designed to maximize the utilization of available modalities for brain tumor segmentation. LS3M excels at efficiently modeling long-range dependencies based on the Mamba design, while incorporating differentiable permutation matrices that reorder input sequences based on modality-specific characteristics. This dynamic reordering ensures that critical spatial inductive biases and long-range semantic correlations inherent in 3D brain MRI are preserved, which is crucial for imcomplete multi-modal brain tumor segmentation.Once the input sequences are reordered using the generated permutation matrix, the Series State Space Model (S3M) block models the relationships between them, capturing both local and long-range dependencies. This enables effective representation of intra-modal and inter-modal relationships, significantly improving segmentation accuracy.Additionally, LS3M incorporates a global input strategy, augmented with relative position embeddings, providing richer contextual information and notably enhancing spatial awareness. Extensive experiments on the BraTS2018 and BraTS2020 datasets demonstrate that LS3M outperforms existing methods, offering a robust solution for brain tumor segmentation, particularly in scenarios with missing modalities.
Poster
Yang Yue · Yulin Wang · Haojun Jiang · Pan Liu · Shiji Song · Gao Huang

[ ExHall D ]

Abstract
Echocardiography is essential for cardiovascular disease detection, but it usually suffers from a heavy reliance on experienced sonographers. To address this, the echocardiography probe guidance system, which predicts real-time movement instructions for acquiring standard plane images, has emerged as a promising technique for enabling fully autonomous or AI-assisted echocardiography scanning.However, it poses unique challenges in developing proper machine learning models, which have rarely been explored in existing studies.In particular, an ideal guidance model needs to comprehend both the heart’s structural anatomy and the dynamic changes resulting from probe movements, while integrating historical visual-motion signals into the decision-making process.In response to these issues, this paper presents EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures …
Poster
Armeet Singh Jatyani · Jiayun Wang · Aditi Chandrashekar · Zihui Wu · Miguel Liu-Schiaffini · Bahareh Tolooshams · Anima Anandkumar

[ ExHall D ]

Abstract
Compressed Sensing MRI reconstructs images of the body's internal anatomy from undersampled measurements, thereby reducing the scan time—the time subjects need to remain still. Recently, deep neural networks have shown great potential for reconstructing high-fidelity images from highly undersampled measurements in the frequency space. However, one needs to train multiple models for different undersampling patterns and desired output image resolutions, since most networks operate on a fixed discretization. Such approaches are highly impractical in clinical settings, where undersampling patterns and image resolutions are frequently changed to accommodate different real-time imaging and diagnostic requirements.We propose a unified model robust to different measurement undersampling patterns and image resolutions in compressed sensing MRI. Our model is based on neural operators, a discretization-agnostic architecture. Neural operators are employed in both image and measurement space, which capture local and global image features for MRI reconstruction. Empirically, we achieve consistent performance across different undersampling rates and patterns, with an average 11% SSIM and 4 dB PSNR improvement over a state-of-the-art, End-to-End VarNet. For efficiency, our inference speed is also 1,400x faster than diffusion methods. The resolution-agnostic design also enhances zero-shot super-resolution and extended field of view in reconstructed images. Our unified model offers a versatile solution …
Poster
Hastings Greer · Lin Tian · François-Xavier Vialard · Roland Kwitt · Raúl San José Estépar · Marc Niethammer

[ ExHall D ]

Abstract
Image registration estimates spatial correspondences between image pairs. These estimates are typically obtained via numerical optimization or regression by a deep network. A desirable property is that a correspondence estimate (e.g., the true oracle correspondence) for an image pair is maintained under deformations of the input images. Formally, the estimator should be equivariant to a desired class of image transformations. In this work, we present careful analyses of equivariance properties in the context of multi-step deep registration networks. Based on these analyses we 1) introduce the notions of [U,U] equivariance (network equivariance to the same deformations of the input images) and [W,U] equivariance (where input images can undergo different deformations); we 2) show that in a suitable multi-step registration setup it is sufficient for overall [W,U] equivariance if the first step has [W,U] equivariance and all others have [U,U] equivariance; we 3) show that common displacement-predicting networks only exhibit [U,U] equivariance to translations instead of the more powerful [W,U] equivariance; and we 4) show how to achieve multi-step [W,U] equivariance via a coordinate-attention mechanism combined with displacement-predicting networks. Our approach obtains excellent practical performance for 3D abdomen, lung, and brain medical image registration. We match or outperform state-of-the-art (SOTA) registration …

Oral Session 6A: 3D from Single or Multi-View Sensors Sun 15 Jun 01:00 p.m.  

Oral
Jay Zhangjie Wu · Alex Zhang · Haithem Turki · Xuanchi Ren · Jun Gao · Mike Zheng Shou · Sanja Fidler · Žan Gojčič · Huan Ling

[ Karl Dean Ballroom ]

Abstract
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2x improvement in FID score over baselines while maintaining 3D consistency.
Oral
Qi Wu · Janick Martinez Esturo · Ashkan Mirzaei · Nicolas Moënne-Loccoz · Žan Gojčič

[ Karl Dean Ballroom ]

Abstract
3D Gaussian Splatting (3DGS) has shown great potential for efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware. However, due to its rasterization-based formulation, 3DGS is constrained to ideal pinhole cameras and lacks support for secondary lighting effects. Recent methods address these limitations by tracing volumetric particles instead, however, this comes at the cost of significantly slower rendering speeds. In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation in 3DGS with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. This modification enables trivial support of distorted cameras with time dependent effects such as rolling shutter, while retaining the efficiency of rasterization. Additionally, we align our rendering formulation with that of tracing-based methods, enabling secondary ray tracing required to represent phenomena such as reflections and refraction within the same 3D representation.
Oral
Xinyi Zhang · Naiqi Li · Angela Dai

[ Karl Dean Ballroom ]

Abstract
While remarkable success has been achived through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields.Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.
Oral
Rundi Wu · Ruiqi Gao · Ben Poole · Alex Trevithick · Changxi Zheng · Jonathan T. Barron · Aleksander Holynski

[ Karl Dean Ballroom ]

Abstract
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos.
Oral
Ruofan Liang · Žan Gojčič · Huan Ling · Jacob Munkberg · Jon Hasselgren · Chih-Hao Lin · Jun Gao · Alexander Keller · Nandita Vijaykumar · Sanja Fidler · Zian Wang

[ Karl Dean Ballroom ]

Abstract
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Specifically, we first train a video diffusion model for inverse rendering on synthetic data, which generalizes well to real-world videos and allows us to auto-label diverse real-world videos. We then co-train our rendering model using both synthetic and auto-labeled real-world data. Experiments demonstrate that Diffusion Renderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input—including relighting, material editing, and realistic object insertion.

Oral Session 6B: Scene Understanding, Image Editing and Multimodal Learning Sun 15 Jun 01:00 p.m.  

Oral
Minhyeok Lee · Suhwan Cho · Jungho Lee · Sunghun Yang · Heeseung Choi · Ig-Jae Kim · Sangyoun Lee

[ ExHall A2 ]

Abstract
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. Additionally, a Vision-Language Fusion (VLF) module enhances the final mask prediction through image and text guidance. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
Oral
Yue Gao · Hong-Xing Yu · Bo Zhu · Jiajun Wu

[ ExHall A2 ]

Abstract
We study reconstructing and predicting 3D fluid appearance and velocity from a single video. Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. Our key insight is to synthesize multiple novel-view videos as references for reconstruction. FluidNexus consists of two key components: (1) a novel-view video synthesizer that combines frame-wise view synthesis with video diffusion refinement for generating realistic videos, and (2) a physics-integrated particle representation coupling differentiable simulation and rendering to simultaneously facilitate 3D fluid reconstruction and prediction. To evaluate our approach, we collect two new real-world fluid datasets featuring textured backgrounds and object interactions. Our method enables dynamic novel view synthesis, future prediction, and interaction simulation from a single fluid video. we will release code and datasets.
Oral
Chen Geng · Yunzhi Zhang · Shangzhe Wu · Jiajun Wu

[ ExHall A2 ]

Abstract
We study the problem of generating temporal object intrinsics—temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose—from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pretrained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan.
Oral
Shangquan Sun · Wenqi Ren · Juxiang Zhou · Shu Wang · Jianhou Gan · Xiaochun Cao

[ ExHall A2 ]

Abstract
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we integrate a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks. Our code will be made publicly available.
Oral
Qifan Yu · Wei Chow · Zhongqi Yue · Kaihang Pan · Yang Wu · Xiaoyang Wan · Juncheng Li · Siliang Tang · Hanwang Zhang · Yueting Zhuang

[ ExHall A2 ]

Abstract
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity. The code is available in \url{https://anonymous.4open.science/r/AnyEdit-C53B}.
Oral
Kaihang Pan · w l · Zhongqi Yue · Tenglong Ao · Liyu Jia · Wei Zhao · Juncheng Li · Siliang Tang · Hanwang Zhang

[ ExHall A2 ]

Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve a new SOTA for multimodal comprehension and generation simultaneously compared with other MLLMs.

Oral Session 6C: Video, Action, and Language Sun 15 Jun 01:00 p.m.  

Oral
feilong tang · Chengzhi Liu · Zhongxing Xu · Ming Hu · Zile Huang · Haochen Xue · Ziyang Chen · Zelin Peng · Zhiwei Yang · Sijin Zhou · Wenxue Li · Yulong Li · Wenxuan Song · Shiyan Su · Wei Feng · Jionglong Su · Mingquan Lin · Yifan Peng · Xuelian Cheng · Imran Razzak · Zongyuan Ge

[ Davidson Ballroom ]

Abstract
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, …
Oral
Yan Shu · Zheng Liu · Peitian Zhang · Minghao Qin · Junjie Zhou · Zhengyang Liang · Tiejun Huang · Bo Zhao

[ Davidson Ballroom ]

Abstract
Long video understanding poses a significant challenge for current Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained by their limited context lengths and the substantial costs while processing long videos. Although several existing methods attempt to reduce visual tokens, their strategies encounter severe bottleneck, restricting MLLMs' ability to perceive fine-grained visual details. In this work, we propose Video-XL, a novel approach that leverages MLLMs' inherent key-value (KV) sparsification capacity to condense the visual input. Specifically, we introduce a new special token, the Visual Summarization Token (VST), for each interval of the video, which summarizes the visual information within the interval as its associated KV. The VST module is trained by instruction fine-tuning, where two optimizing strategies are offered. 1. Curriculum learning, where VST learns to make small (easy) and large compression (hard) progressively. 2. Composite data curation, which integrates single-image, multi-image, and synthetic data to overcome the scarcity of long-video instruction data. The compression quality is further improved by dynamic compression, which customizes compression granularity based on the information density of different video intervals. Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes across multiple popular …
Oral
Jian Liang · Wenke Huang · Guancheng Wan · Qu Yang · Mang Ye

[ Davidson Ballroom ]

Abstract
While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance.To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge.Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights.Extensive experimental results demonstrate that even at very high degree of sparsity ( 5\%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.
Oral
Songhao Han · Wei Huang · Hairong Shi · Le Zhuo · Xiu Su · Shifeng Zhang · Xu Zhou · Xiaojuan Qi · Yue Liao · Si Liu

[ Davidson Ballroom ]

Abstract
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities.
Oral
Lan Wang · Yujia Chen · Wen-Sheng Chu · Vishnu Naresh Boddeti · Du Tran

[ Davidson Ballroom ]

Abstract
Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces **SE**mantic **A**ttention **L**earning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.
Oral
Boseung Jeong · Jicheol Park · Sungyeon Kim · Suha Kwak

[ Davidson Ballroom ]

Abstract
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content.Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals.In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment.Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.




Poster Session 6 Sun 15 Jun 04:00 p.m.  

Poster
Haotian Wang · Yuzhe Weng · Yueyan Li · Zilu Guo · Jun Du · Shutong Niu · Jiefeng Ma · Shan He · Wu Xiaoyan · Qiming Hu · Bing Yin · Cong Liu · Qingfeng Liu

[ ExHall D ]

Abstract
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.
Poster
Huaize Liu · WenZhang Sun · Donglin Di · Shibo Sun · Jiahui Yang · Hujun Bao · Changqing Zou

[ ExHall D ]

Abstract
The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models; 3) an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals—such as audio, text, and labels—to enhance audio-driven emotion control. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
Poster
Shuling Zhao · Fa-Ting Hong · Xiaoshui Huang · Dan Xu

[ ExHall D ]

Abstract
Talking head video generation aims to generate a realistic talking head video that preserves the person’s identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns.~Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework.Extensive experiments …
Poster
Yukang Lin · Hokit Fung · Jianjin Xu · Zeping Ren · Adela S.M. Lau · Guosheng Yin · Xiu Li

[ ExHall D ]

Abstract
Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. In this paper, we present a novel two-stage text-guided framework, MVPortrait, to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.
Poster
Fa-Ting Hong · Zhan Xu · Haiyang Liu · Qinjie Lin · Luchuan Song · ZHIXIN SHU · Yang Zhou · Duygu Ceylan · Dan Xu

[ ExHall D ]

Abstract
Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control. We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation. To train our model, we curated a large dataset from public TED talks featuring varied shots of the same character, helping the model learn synthesis for different perspectives. Our experimental results show that with the same number of reference images, our model performs favorably compared to the …
Poster
Yuming Gu · Phong Tran · Yujian Zheng · Hongyi Xu · Heyuan Li · Adilbek Karmanov · Hao Li

[ ExHall D ]

Abstract
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation.While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles.We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
Poster
Cong Wang · Di Kang · Heyi Sun · SHENHAN QIAN · Zixuan Wang · Linchao Bao · Song-Hai Zhang

[ ExHall D ]

Abstract
Creating high-fidelity head avatars from multi-view videos is essential for many AR/VR applications. However, current methods often struggle to achieve high-quality renderings across all head components (e.g., skin vs. hair) due to the limitations of using one single representation for elements with varying characteristics. In this paper, we introduce a Hybrid Mesh-Gaussian Head Avatar (MeGA) that models different head components with more suitable representations. Specifically, we employ an enhanced FLAME mesh for the facial representation and predict a UV displacement map to provide per-vertex offsets for improved personalized geometric details. To achieve photorealistic rendering, we use deferred neural rendering to obtain facial colors and decompose neural textures into three meaningful parts. For hair modeling, we first build a static canonical hair using 3D Gaussian Splatting. A rigid transformation and an MLP-based deformation field are further applied to handle complex dynamic expressions. Combined with our occlusion-aware blending, MeGA generates higher-fidelity renderings for the whole head and naturally supports diverse downstream tasks. Experiments on the NeRSemble dataset validate the effectiveness of our designs, outperforming previous state-of-the-art methods and enabling versatile editing capabilities, including hairstyle alteration and texture editing.
Poster
Dongbin Zhang · Yunfei Liu · Lijian Lin · Ye Zhu · Kangjie Chen · Minghan Qin · Yu Li · Haoqian Wang

[ ExHall D ]

Abstract
Reconstructing animatable and high-quality 3D head avatars from monocular videos, especially with realistic relighting, is a valuable task. However, the limited information from single-view input, combined with the complex head poses and facial movements, makes this challenging. Previous methods achieve real-time performance by combining 3D Gaussian Splatting with a parametric head model, but the resulting head quality suffers from inaccurate face tracking and limited expressiveness of the deformation model. These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. HRAvatar reduces tracking errors through end-to-end optimization and better captures individual facial deformations using learnable blendshapes and learnable linear blend skinning. Additionally, it decomposes head appearance into several physical properties and incorporates physically-based shading to account for environmental lighting. Extensive experiments demonstrate that HRAvatar not only reconstructs superior-quality heads but also achieves realistic visual effects under varying lighting conditions.
Poster
Youyi Zhan · Tianjia Shao · Yin Yang · Kun Zhou

[ ExHall D ]

Abstract
Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing …
Poster
Yiyu Zhuang · Jiaxi Lv · Hao Wen · Qing Shuai · Ailing Zeng · Hao Zhu · Shifeng Chen · Yujiu Yang · Xun Cao · Wei Liu

[ ExHall D ]

Abstract
Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman GEnerated training dataset, HuGe100K, consisting of 100K diverse, photorealistic human images with corresponding 24-view in a static pose or dynamic pose frames generated via a pose-controllable image-to-video model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space of a given human image. This model is trained to disentangle human pose, shape, clothing geometry, and texture. Accordingly, the estimated Gaussians can be animated robustly without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the generalizable ability to efficiently reconstruct photorealistic humans in under 1 second using a single GPU. Additionally, it seamlessly supports various applications, including animation, shape, and texture editing tasks.
Poster
Chen Geng · Yunzhi Zhang · Shangzhe Wu · Jiajun Wu

[ ExHall D ]

Abstract
We study the problem of generating temporal object intrinsics—temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose—from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pretrained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan.
Poster
Xinyi Zhang · Naiqi Li · Angela Dai

[ ExHall D ]

Abstract
While remarkable success has been achived through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields.Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.
Poster
Xueting Li · Ye Yuan · Shalini De Mello · Miles Macklin · Jonathan Leaf · Gilles Daviet · Jan Kautz · Umar Iqbal

[ ExHall D ]

Abstract
We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing and human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing graphics pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first leverage two text-conditioned diffusion models to generate garment mesh and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanism for different body parts. As a result, our …
Poster
Hui En Pang · Shuai Liu · Zhongang Cai · Lei Yang · Tianwei Zhang · Ziwei Liu

[ ExHall D ]

Abstract
We present Disco4D, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. 1) Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. 2) It adopts diffusion models to enhance the 3D generation process, e.g., modeling occluded parts not visible in the input image. 3) It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks.
Poster
Yuze He · Yanning Zhou · Wang Zhao · Zhongkai Wu · Kaiwen Xiao · Yang Wei · Yong-Jin Liu · Xiao Han

[ ExHall D ]

Abstract
We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which struggle with limited decomposability, unsatisfactory quality, and long optimization times, StdGEN features decomposability, effectiveness and efficiency; i.e., it generates intricately detailed 3D characters with separated semantic components such as the body, clothes, and hair, in three minutes. At the core of StdGEN is our proposed Semantic-aware Large Reconstruction Model (S-LRM), a transformer-based generalizable model that jointly reconstructs geometry, color and semantics from multi-view images in a feed-forward manner. A differentiable multi-layer semantic surface extraction scheme is introduced to acquire meshes from hybrid implicit fields reconstructed by our S-LRM. Additionally, a specialized efficient multi-view diffusion model and an iterative multi-layer surface refinement module are integrated into the pipeline to facilitate high-quality, decomposable 3D character generation. Extensive experiments demonstrate our state-of-the-art performance in 3D anime character generation, surpassing existing baselines by a significant margin in geometry, texture and decomposability. StdGEN offers ready-to-use semantic-decomposed 3D characters and enables flexible customization for a wide range of applications.
Poster
Philipp Flotho · Moritz Piening · Anna Kukleva · Gabriele Steidl

[ ExHall D ]

Abstract
Facial analysis is a key component in a wide range of ap-plications such as security, autonomous driving, entertainment, and healthcare. Despite the availability of various fa-cial RGB datasets, the thermal modality, which plays a crucial role in life sciences, medicine, and biometrics, has beenlargely overlooked. To address this gap, we introduce the T-FAKE dataset, a new large-scale synthetic thermal datasetwith sparse and dense landmarks. To facilitate the creationof the dataset, we propose a novel RGB2Thermal loss function, which enables the domain-adaptive transfer of thermal style to RGB faces. By utilizing the Wasserstein distance between thermal and RGB patches and the statisticalanalysis of clinical temperature distributions on faces, weensure that the generated thermal images closely resemblereal samples. Using RGB2Thermal style transfer based onour RGB2Thermal loss function, we create the large-scalesynthetic thermal T-FAKE dataset. Leveraging our novel T-FAKE dataset, probabilistic landmark prediction, and labeladaptation networks, we demonstrate significant improvements in landmark detection methods on thermal imagesacross different landmark conventions. Our models showexcellent performance with both sparse 70-point landmarksand dense 478-point landmark annotations
Poster
Jianlong Jin · Chenglong Zhao · Ruixin Zhang · Sheng Shang · Jianqing Xu · Jingyun Zhang · ShaoMing Wang · Yang Zhao · Shouhong Ding · Wei Jia · Yunsheng Wu

[ ExHall D ]

Abstract
Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted B\'ezier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints.However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints.This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency.To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method.By applying our proposed K-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency.Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets.Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.Our code and pre-trained models will be released.
Poster
Hanzhang Tu · Zhanfeng Liao · Boyao Zhou · Shunyuan Zheng · Xilong Zhou · Liuxin ZHANG · QianYing Wang · Yebin Liu

[ ExHall D ]

Abstract
We present an efficient approach for generalizable clothed human digitalization. Unlike previous methods that necessitate subject-wise optimizations or discount watertight geometry, the proposed method is dedicated to reconstruct complete human shape and Gaussian Splatting via sparse view RGB input. We extract fine-grained mesh by the combination of implicit occupancy field regression and explicit disparity estimation between views. The reconstructed high-quality geometry allows us to easily anchor Gaussian primitives according to surface normal and texture, which allows 6-DoF photorealistic novel view synthesis. Further, we introduce a simple yet effective algorithm to split Gaussian primitives in high-frequency area to enhance the visual quality. Without the assistance of templates like SMPL, our method can tackle loose clothing like dresses and costumes. To this end, we train our reconstruction pipeline on a large amount of human scan data, to achieve generalization capability across datasets. Our method outperforms recent methods in terms of novel view synthesis, while keeping high-efficiency, enabling the potential of deployment in real-time applications.
Poster
Zijian He · Yuwei Ning · Yipeng Qin · Guangrun Wang · Sibei Yang · Liang Lin · Guanbin Li

[ ExHall D ]

Abstract
Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the {\it equivalence} between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views.To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach.
Poster
Xuanpu Zhang · Dan Song · pengxin zhan · Tianyu Chang · Jianhao Zeng · Qing-Guo Chen · Weihua Luo · An-An Liu

[ ExHall D ]

Abstract
Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of the specific person.Recent methods model virtual try-on as image mask-inpaint task, which requires masking the person image and results in significant loss of spatial information. Especially, for in-the-wild try-on scenarios with complex poses and occlusions, mask-based methods often introduce noticeable artifacts. Our research found that a mask-free approach can fully leverage spatial and lighting information from the original person image, enabling high-quality virtual try-on. Consequently, we propose a novel training paradigm for a mask-free try-on diffusion model. We ensure the model's mask-free try-on capability by creating high-quality pseudo-data and further enhance its handling of complex spatial information through effective in-the-wild data augmentation. Besides, a try-on localization loss is designed to concentrate on try-on area while suppressing garment features in non-try-on areas, ensuring precise rendering of garments and preservation of fore/back-ground. In the end, we introduce BooW-VTON, the mask-free virtual try-on diffusion model, which delivers SOTA try-on quality without parsing cost. Extensive qualitative and quantitative experiments have demonstrated superior performance in wild scenarios with such a low-demand input.
Poster
Daisheng Jin · Jiangbei Hu · Baixin Xu · Yuxin Dai · Chen Qian · Ying He

[ ExHall D ]

Abstract
In this study, we introduce a novel two-stage technique for decomposing and reconstructing facial features from sparse-view images, a task made challenging by the unique geometry and complex skin reflectance of each individual. To synthesize 3D facial models more realistically, we endeavor to decouple key facial attributes from the RGB color, including geometry, diffuse reflectance, and specular reflectance. Specifically, we design a Sparse-view Face Decomposition Model (SFDM): 1) In the first stage, we create a general facial template from a wide array of individual faces, encapsulating essential geometric and reflectance characteristics. 2) Guided by this template, we refine a specific facial model for each individual in the second stage, considering the interaction between geometry and reflectance, as well as the effects of subsurface scattering on the skin. With these advances, our method can reconstruct high-quality facial representations from as few as three images. The comprehensive evaluation and comparison reveal that our approach outperforms existing methods by effectively disentangling geometric and reflectance components, significantly enhancing the quality of synthesized novel views, and paving the way for applications in facial relighting and reflectance editing. The code will be made available to the public.
Poster
Wenjun Wei · Yanlin Qian · Huaian Chen · Junkang Dai · Yi Jin

[ ExHall D ]

Abstract
Traditional auto white balance (AWB) algorithms typically assume a single global illuminant source, which leads to color distortions in multi-illuminant scenes. While recent neural network-based methods have shown excellent accuracy in such scenarios, their high parameter count and computational demands limit their practicality for real-time video applications. The Fast Fourier Color Constancy (FFCC) algorithm was proposed for single-illuminant-source scenes, predicting a global illuminant source with high efficiency. However, it cannot be directly applied to multi-illuminant scenarios unless specifically modified. To address this, we propose Integral Fast Fourier Color Constancy (IFFCC), an extension of FFCC tailored for multi-illuminant scenes. IFFCC leverages the proposed integral UV histogram to accelerate histogram computations across all possible regions in Cartesian space and parallelizes Fourier-based convolution operations, resulting in a spatially-smooth illumination map. This approach enables high-accuracy, real-time AWB in multi-illuminant scenes. Extensive experiments show that IFFCC achieves accuracy that is on par with or surpasses that of pixel-level neural networks, while reducing the parameter count by over 400× and processing speed by 20 - 100× faster than network-based approaches.
Poster
Hao Zhao · Mingjia Li · Qiming Hu · Xiaojie Guo

[ ExHall D ]

Abstract
Recent deep-learning-based approaches to single-image reflection removal have shown promising advances, primarily for two reasons: 1) the utilization of recognition-pretrained features as inputs, and 2) the design of dual-stream interaction networks. However, according to the Information Bottleneck principle, high-level semantic clues tend to be compressed or discarded during layer-by-layer propagation. Additionally, interactions in dual-stream networks follow a fixed pattern across different layers, limiting overall performance. To address these limitations, we propose a novel architecture called Reversible Decoupling Network (RDNet), which employs a reversible encoder to secure valuable information while flexibly decoupling transmission- and reflection-relevant features during the forward pass. Furthermore, we customize a transmission-rate-aware prompt generator to dynamically calibrate features, further boosting performance. Extensive experiments demonstrate the superiority of RDNet over existing SOTA methods on five widely-adopted benchmark datasets. Our code will be made publicly available.
Poster
Shouhang Zhu · Chenglin Li · Yuankun Jiang · Li Wei · Nuowen Kan · Ziyang Zheng · Wenrui Dai · Junni Zou · Hongkai Xiong

[ ExHall D ]

Abstract
Autofocus is a crucial component of modern digital cameras. While recent learning-based methods achieve state-of-the-art in-focus prediction accuracy, they unfortunately ignore the potential focus hunting phenomenon of back-and-forth lens movement in the multi-step focusing procedure. To address this, in this paper, we propose an expert regularized deep reinforcement learning (DRL)-based approach for autofocus, which is able to utilize the sequential information of lens movement trajectory to both enhance the multi-step in-focus prediction accuracy and reduce the chance of focus hunting. Our method generally follows an actor-critic framework. To accelerate the DRL's training with a higher sample efficiency, we initialize the policy with a pre-trained single-step prediction network, where the network is further improved by modifying the output of absolute in-focus position distribution to the relative lens movement distribution to establish a better mapping between input images and lens movement. To further stabilize DRL's training with lower focus hunting occurrence in the resulting lens movement trajectory, we generate some offline trajectories based on the prior knowledge to avoid focus hunting, which are then leveraged as an offline dataset of expert trajectories to regularize actor network's training. Empirical evaluations show that our method outperforms those learning-based methods on public benchmarks, with higher …
Poster
Jiayin Zhao · Zhenqi Fu · Tao Yu · Hui Qiao

[ ExHall D ]

Abstract
Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, current LFM reconstruction algorithms are highly sensitive to sensor noise and lack robustness when applied to experimental data. To address these challenges, this paper presents an unsupervised view-to-view LFM 3D reconstruction framework, named V2V3D. Unlike existing methods that directly use all views for reconstruction, V2V3D divides the views into two subsets, with each subset generating corresponding volumes and working together to effectively remove sensor noise. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset generated using two-photon excitation, including both the light field images and the corresponding 3D intensity volumes. Extensive experiments demonstrate that our unsupervised approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions. Our code and dataset will be made publicly available.
Poster
Liao Shen · Tianqi Liu · Huiqiang Sun · Jiaqi Li · Zhiguo Cao · Wei Li · Chen Change Loy

[ ExHall D ]

Abstract
Recent advances in 3D Gaussian Splatting (3D-GS) have shown remarkable success in representing 3D scenes and generating high-quality, novel views in real-time. However, 3D-GS and its variants assume that input images are captured based on pinhole imaging and are fully in focus. This assumption limits their applicability, as real-world images often feature shallow depth-of-field (DoF). In this paper, we introduce DoF-Gaussian, a controllable depth-of-field method for 3D-GS. We develop a lens-based imaging model based on geometric optics principles to control DoF effects. To ensure accurate scene geometry, we incorporate depth priors adjusted per scene, and we apply defocus-to-focus adaptation to minimize the gap in the circle of confusion. We also introduce a synthetic dataset to assess refocusing capabilities and the model’s ability to learn precise lens parameters. Our framework is customizable and supports various interactive applications. Extensive experiments confirm the effectiveness of our method. Code and the dataset will be made publicly available.
Poster
Ziteng Cui · Xuangeng Chu · Tatsuya Harada

[ ExHall D ]

Abstract
Capturing high-quality photographs across diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) strongly influence image quality. This difficulty intensifies in multi-view scenarios, where each viewpoint can have distinct lighting and image signal processor (ISP) settings, causing photometric inconsistencies between views. These lighting degradations and view variations significant challenges to both NeRF- and 3D Gaussian Splatting (3DGS)-based novel view synthesis (NVS) frameworks.To address this, we introduce Luminance-GS, a novel approach to achieve high-quality novel view synthesis results under diverse and challenging lighting conditions using 3DGS. By adopting per-view color space mapping and view adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions—including low-light, overexposure, and varying exposure—without altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality. We would release the source code.
Poster
Qi Wu · Janick Martinez Esturo · Ashkan Mirzaei · Nicolas Moënne-Loccoz · Žan Gojčič

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has shown great potential for efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware. However, due to its rasterization-based formulation, 3DGS is constrained to ideal pinhole cameras and lacks support for secondary lighting effects. Recent methods address these limitations by tracing volumetric particles instead, however, this comes at the cost of significantly slower rendering speeds. In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation in 3DGS with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. This modification enables trivial support of distorted cameras with time dependent effects such as rolling shutter, while retaining the efficiency of rasterization. Additionally, we align our rendering formulation with that of tracing-based methods, enabling secondary ray tracing required to represent phenomena such as reflections and refraction within the same 3D representation.
Poster
Ruofan Liang · Žan Gojčič · Huan Ling · Jacob Munkberg · Jon Hasselgren · Chih-Hao Lin · Jun Gao · Alexander Keller · Nandita Vijaykumar · Sanja Fidler · Zian Wang

[ ExHall D ]

Abstract
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Specifically, we first train a video diffusion model for inverse rendering on synthetic data, which generalizes well to real-world videos and allows us to auto-label diverse real-world videos. We then co-train our rendering model using both synthetic and auto-labeled real-world data. Experiments demonstrate that Diffusion Renderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input—including relighting, material editing, and realistic object insertion.
Poster
Youjia Zhang · Anpei Chen · Yumin Wan · Zikai Song · Junqing Yu · Yawei Luo · Wei Yang

[ ExHall D ]

Abstract
In this paper, we introduce Ref-GS, a novel approach for directional light factorization in 2D Gaussian splatting, which enables photorealistic view-dependent appearance rendering and precise geometry recovery. Ref-GS builds upon the deferred rendering of Gaussian splatting and applies directional encoding to the deferred-rendered surface, effectively reducing the ambiguity between orientation and viewing angle. Next, we introduce a spherical mip-grid to capture varying levels of surface roughness, enabling roughness-aware Gaussian shading. Additionally, we propose a simple yet efficient geometry-lighting factorization that connects geometry and lighting via the vector outer product, significantly reducing renderer overhead when integrating volumetric attributes. Our method achieves superior photorealistic rendering for a range of open-world scenes while also accurately recovering geometry.
Poster
Chenhao Li · Taishi Ono · Takeshi Uemori · Sho Nitta · Hajime Mihara · Alexander Gatto · Hajime Nagahara · Yusuke Moriuchi

[ ExHall D ]

Abstract
Recent inverse rendering methods have greatly improved shape, material, and illumination reconstruction by utilizing polarization cues. However, existing methods only support dielectrics, ignoring conductors that are found everywhere in life. Since conductors and dielectrics have different reflection properties, using previous conductor methods will lead to obvious errors. In addition, conductors are glossy, which may cause strong specular reflection and is hard to reconstruct. To solve the above issues, we propose NeISF++, an inverse rendering pipeline that supports conductors and dielectrics. The key ingredient for our proposal is a general pBRDF that describes both conductors and dielectrics. As for the strong specular reflection problem, we propose a novel geometry initialization method using DoLP images. This physical cue is invariant to intensities and thus robust to strong specular reflections. Experimental results on our synthetic and real datasets show that our method surpasses the existing polarized inverse rendering methods for geometry and material decomposition as well as downstream tasks like relighting.
Poster
Yue Gao · Hong-Xing Yu · Bo Zhu · Jiajun Wu

[ ExHall D ]

Abstract
We study reconstructing and predicting 3D fluid appearance and velocity from a single video. Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. Our key insight is to synthesize multiple novel-view videos as references for reconstruction. FluidNexus consists of two key components: (1) a novel-view video synthesizer that combines frame-wise view synthesis with video diffusion refinement for generating realistic videos, and (2) a physics-integrated particle representation coupling differentiable simulation and rendering to simultaneously facilitate 3D fluid reconstruction and prediction. To evaluate our approach, we collect two new real-world fluid datasets featuring textured backgrounds and object interactions. Our method enables dynamic novel view synthesis, future prediction, and interaction simulation from a single fluid video. we will release code and datasets.
Poster
ZhiFei Chen · Tianshuo Xu · Wenhang Ge · Leyi Wu · Dongyu Yan · Jing He · Luozhou Wang · Lu Zeng · Shunsi Zhang · Ying-Cong Chen

[ ExHall D ]

Abstract
Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images.Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrating a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.
Poster
Zexin He · Tengfei Wang · Xin Huang · Xingang Pan · Ziwei Liu

[ ExHall D ]

Abstract
Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present **Neural LightRig**, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, **1)** we first leverage illumination priors from large-scale diffusion models to build our *multi-light diffusion model* on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. **2)** By using these varied lighting images to reduce estimation uncertainty, we train a *large G-buffer model* with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Our code and dataset will be made publicly available.
Poster
Jiahui Fan · Fujun Luan · Jian Yang · Milos Hasan · Beibei Wang

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has shown impressive results for the novel view synthesis task, where lighting is assumed to be fixed. However, creating relightable 3D assets, especially for objects with ill-defined shapes (fur, fabric, etc.), remains a challenging task. The decomposition between light, geometry, and material is ambiguous, especially if either smooth surface assumptions or surface-based analytical shading models do not apply. We propose Relightable Neural Gaussians (RNG), a novel 3DGS-based framework that enables the relighting of objects with both hard surfaces or soft boundaries, while avoiding assumptions on the shading model. We condition the radiance at each point on both view and light directions. We also introduce a shadow cue, as well as a depth refinement network to improve shadow accuracy. Finally, we propose a hybrid forward-deferred fitting strategy to balance geometry and appearance quality. Our method achieves significantly faster training (1.3 hours) and rendering (60 frames per second) compared to a prior method based on neural radiance fields and produces higher-quality shadows than a concurrent 3DGS-based method.
Poster
Bruno Galerne · Jianling WANG · Lara Raad · Jean-michel Morel

[ ExHall D ]

Abstract
Applying style transfer to a full 3D environment is a challenging task that has seen many developments since the advent of neural rendering. 3D Gaussian splatting (3DGS) has recently pushed further many limits of neural rendering in terms of training speed and reconstruction quality. This work introduces SGSST: Scaling Gaussian Splatting Style Transfer, an optimization-based method to apply style transfer to pretrained 3DGS scenes. We demonstrate that a new multiscale loss based on global neural statistics, that we name SOS for Simultaneously Optimized Scales, enables style transfer to ultra-high resolution 3D scenes. Not only SGSST pioneers 3D scene style transfer at such high image resolutions, it also produces superior visual quality as assessed by thorough qualitative, quantitative and perceptual comparisons.
Poster
Chuhao Chen · Zhiyang Dou · Chen Wang · Yiming Huang · Anjun Chen · Qiao Feng · Jiatao Gu · Lingjie Liu

[ ExHall D ]

Abstract
Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing system identification in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, \name, a generalizable video-based approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, \name first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, \name enables high-quality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data. Our code and models will be publicly available upon acceptance.
Poster
Xin Huang · Tengfei Wang · Ziwei Liu · Qing Wang

[ ExHall D ]

Abstract
We present **Material Anything**, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.
Poster
Jialun Liu · Jinbo Wu · Xiaobo Gao · JiaKui Hu · Bojun Xiong · Xing Liu · Chen Zhao · Hongbin Pei · Haocheng Feng · Yingying Li · Errui Ding · Jingdong Wang

[ ExHall D ]

Abstract
This paper introduces TexGarment, an efficient method for synthesizing high-quality, 3D-consistent garment textures in UV space. Traditional approaches based on 2D-to-3D mapping often suffer from 3D inconsistency, while methods learning from limited 3D data lack sufficient texture diversity. These limitations are particularly problematic in garment texture generation, where high demands exist for both detail and variety. To address these challenges, TexGarment leverages a pre-trained text-to-image diffusion Transformer model with robust generalization capabilities, introducing structural information to guide the model in generating 3D-consistent garment textures in a single inference step. Specifically, We utilize the 2D UV position map to guide the layout during the UV texture generation process, ensuring a coherent texture arrangement and enhancing it by integrating global 3D structural information from the mesh surface point cloud. This combined guidance effectively aligns 3D structural integrity with 2D layout. Our method efficiently generates high-quality, diverse UV textures in a single inference step while maintaining 3D consistency. Experimental results validate the effectiveness of TexGarment, achieving state-of-the-art performance in 3D garment texture generation.
Poster
Zhaoxi Chen · Jiaxiang Tang · Yuhao Dong · Ziang Cao · Fangzhou Hong · Yushi Lan · Tengfei Wang · Haozhe Xie · Tong Wu · Shunsuke Saito · Liang Pan · Dahua Lin · Ziwei Liu

[ ExHall D ]

Abstract
The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. Extensive qualitative and quantitative experiments are conducted to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.
Poster
Hao Guo · Xiaoshui Huang · Hao jiacheng · Yunpeng Bai · Hongping Gan · Yilei Shi

[ ExHall D ]

Abstract
Despite advancements in Computer-Aided-Design (CAD) generation, direct generation of complex Boundary Representation (B-rep) CAD models remains challenging. This difficulty arises from the parametric nature of B-rep data, complicating the encoding and generation of its geometric and topological information. To address this, we introduce BrepGiff, a lightweight generation approach for high-quality and complex B-rep model based on 3D Graph Diffusion. First, we transfer B-rep models into 3D graphs representation. Specifically, BrepGiff extracts and integrates topological and geometric features to construct a 3D graph where nodes correspond to face centroids in 3D space, preserving adjacency and degree information. Geometric features are derived by sampling points in the UV domain and extracting face and edge features. Then, BrepGiff applies a Graph Attention Network (GAT) to enforce topological constraints from local to global during the degree-guided diffusion process. With the 3D graph representation and efficient diffusion process, our method significantly reduces the computational cost and improves the quality, thus achieving lightweight generation of complex models. Experiments show that BrepGiff can generate complex B-rep models (>100 faces) using only 2 RTX4090 GPUs, achieving state-of-the-art performance in B-rep generation.
Poster
Xinyu Gao · Ziyi Yang · Bingchen Gong · Xiaoguang Han · Sipeng Yang · Xiaogang Jin

[ ExHall D ]

Abstract
Using parts of existing models to rebuild new models, commonly termed as example-based modeling, is a classical methodology in the realm of computer graphics. Previous works mostly focus on shape composition, making them very hard to use for realistic composition of 3D objects captured from real-world scenes. This leads to combining multiple NeRFs into a single 3D scene to achieve seamless appearance blending. However, the current SeamlessNeRF method struggles to achieve interactive editing and harmonious stitching for real-world scenes due to its gradient-based strategy and grid-based representation.To this end, we present an example-based modeling method that combines multiple Gaussian fields in a point-based representation using sample-guided synthesis. Specifically, as for composition, we create a GUI to segment and transform multiple fields in real time, easily obtaining a semantically meaningful composition of models represented by 3D Gaussian Splatting (3DGS). For texture blending, due to the discrete and irregular nature of 3DGS, straightforwardly applying gradient propagation as SeamlssNeRF is not supported. Thus, a novel sampling-based cloning method is proposed to harmonize the blending while preserving the original rich texture and content. Our workflow consists of three steps: 1) real-time segmentation and transformation of a Gaussian model using a well-tailored GUI, 2) KNN …
Poster
Stefan Lionar · Jiabin Liang · Gim Hee Lee

[ ExHall D ]

Abstract
We introduce TreeMeshGPT, an autoregressive Transformer designed to generate high-quality artistic meshes aligned with input point clouds. Instead of the conventional next-token prediction in autoregressive Transformer, we propose a novel Autoregressive Tree Sequencing where the next input token is retrieved from a dynamically growing tree structure that is built upon the triangle adjacency of faces within the mesh. Our sequencing enables the mesh to extend locally from the last generated triangular face at each step, and therefore reduces training difficulty and improves mesh quality. Our approach represents each triangular face with two tokens, achieving a compression rate of approximately 22% compared to the naive face tokenization. Due to this efficient tokenization technique, we push the boundary of artistic mesh generation to the face limit of 5,500 triangles with a strong point cloud condition of 2,048 tokens, surpassing previous methods. Furthermore, our method generates mesh with strong normal orientation constraints, minimizing flipped normals commonly encountered in previous methods. Our experiments show that TreeMeshGPT enhances the mesh generation quality with refined details and normal orientation consistency.
Poster
Yuezhi Yang · Qimin Chen · Vladimir G. Kim · Siddhartha Chaudhuri · Qixing Huang · Zhiqin Chen

[ ExHall D ]

Abstract
We introduce the first method for generating Vector Displacement Maps (VDMs): parameterized, detailed geometric stamps commonly used in 3D modeling. Given a single input image, our method first generates multi-view normal maps and then reconstructs a VDM from the normals via a novel reconstruction pipeline. We also propose an efficient algorithm for extracting VDMs from 3D objects, and present the first academic VDM dataset. Compared to existing 3D generative models focusing on complete shapes, we focus on generating parts that can be seamlessly attached to shape surfaces. The method gives artists rich control over adding geometric details to a 3D shape. Experiments demonstrate that our approach outperforms existing baselines. Generating VDMs offers additional benefits, such as using 2D image editing to customize and refine 3D details.
Poster
Kai He · Chin-Hsuan Wu · Igor Gilitschenski

[ ExHall D ]

Abstract
Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to learn'' the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.
Poster
Jiamin WU · Kenkun Liu · Han Gao · Xiaoke Jiang · Yuan Yao · Lei Zhang

[ ExHall D ]

Abstract
Rencently, Gaussian splatting has demonstrated significant success in novel view synthesis. Current methods often regress Gaussians with pixel or point cloud correspondence, linking each Gaussian with a pixel or a 3D point. This leads to the redundancy of Gaussians being used to overfit the correspondence rather than the objects represented by the 3D Gaussians themselves, consequently wasting resources and lacking accurate geometries or textures.In this paper, we introduce LeanGaussian, a novel approach that treats each query in deformable Transformer as one 3D Gaussian ellipsoid, breaking the pixel or point cloud correspondence constraints. We leverage deformable decoder to iteratively refine the Gaussians layer-by-layer with the image features as keys and values.Notably, the center of each 3D Gaussian is defined as 3D reference points, which are then projected onto the image for deformable attention in 2D space.On both the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level, trained with the Objaverse dataset), our approach, outperforms prior methods by approximately 6.1\%, achieving a PSNR of 25.44 and 22.36, respectively. Additionally, our method achieves a 3D reconstruction speed of 7.2 FPS and rendering speed 500 FPS.
Poster
Guofeng Feng · Siyan Chen · Rong Fu · Zimu Liao · Yi Wang · Tao Liu · Boni Hu · Linning Xu · PeiZhilin · Hengjie Li · Xiuhong Li · Ninghui Sun · Xingcheng Zhang · Bo Dai

[ ExHall D ]

Abstract
Recently the remarkable progress in 3D Gaussian Splatting (3DGS) has demonstrated huge potential over traditional rendering techniques, attracting significant attention from both industry and academia. Due to the presence of numerous anisotropic Gaussian representations in large-scale and high-resolution scenes, real-time rendering with 3DGS remains a challenging problem and is also rarely studied. We proposed FlashGS, an open-source CUDA library with Python bind, with comprehensive algorithm design and optimizations, encompassing redundancy elimination, adaptive scheduling, and efficient pipelining. We first eliminate substantial redundant tasks through precise Gaussian intersection tests, considering the essence of the 3DGS rasterizer. During task partitioning, we introduce an adaptive scheduling strategy that accounts for variations in the size and shape of Gaussians. We also design a multi-stage pipeline strategy for color computations in rendering, further accelerating the process. An extensive evaluation of FlashGS has been conducted across a diverse spectrum of synthetic and real-world 3D scenes, covering a variety of scene sizes up to 2.7 km2 cityscape and resolutions up to 4K. We achieve up to 30.53× faster than 3DGS with an average of 7.2×, rendering at a minimum of 125.9 FPS, achieving state-of-the-art performance.
Poster
Peihao Wang · Yuehao Wang · Dilin Wang · Sreyas Mohan · Zhiwen Fan · Lemeng Wu · Ruisi Cai · Yu-Ying Yeh · Zhangyang Wang · Qiang Liu · Rakesh Ranjan

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis.By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points.However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands--posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce **SteepGS**, incorporating *steepest density control*, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a 50\% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.
Poster
Yangming Zhang · Wenqi Jia · Wei Niu · Miao Yin

[ ExHall D ]

Abstract
3D Gaussian Splatting (3DGS) has emerged as a mainstream for novel view synthesis, leveraging continuous aggregations of Gaussian functions to model scene geometry. However, 3DGS suffers from substantial memory requirements to store the large amount of Gaussians, hindering its efficiency and practicality. To address this challenge, we introduce GaussianSpa, an optimization-based simplification framework for compact and high-quality 3DGS. Specifically, we formulate the simplification objective as a constrained optimization problem associated with the 3DGS training. Correspondingly, we propose an efficient "optimizing-sparsifying" solution for the formulated problem, alternately solving two independent sub-problems and gradually imposing substantial sparsity onto the Gaussians in the 3DGS training process. We conduct quantitative and qualitative evaluations on various datasets, demonstrating the superiority of GaussianSpa over existing state-of-the-art approaches. Notably, GaussianSpa achieves an average PSNR improvement of 0.9 dB on the real-world Deep Blending dataset with 10× fewer Gaussians compared to the vanilla 3DGS.
Poster
Seungtae Nam · Xiangyu Sun · Gyeongjin Kang · Younggeun Lee · Seungjun Oh · Eunbyung Park

[ ExHall D ]

Abstract
Generalized feed-forward Gaussian models have shown remarkable progress in sparse-view 3D reconstruction, leveraging prior knowledge learned from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of generated Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be extended and applied to the feed-forward models, it may not be ideally suited for generalized settings. In this paper, we present Generative Densification, an efficient and generalizable densification strategy that can selectively generate fine Gaussians for high-fidelity 3D reconstruction. Unlike the 3D-GS densification strategy, we densify the feature representations from the feed-forward models rather than the raw Gaussians, making use of the prior knowledge embedded in the features for enhanced generalization. Experimental results demonstrate the effectiveness of our approach, achieving the state-of-the-art rendering quality in both object-level and scene-level reconstruction, with noticeable improvements in representing fine details.
Poster
Zhihao Shi · Dong Huo · Yuhongze Zhou · Yan Min · Juwei Lu · Xinxin Zuo

[ ExHall D ]

Abstract
Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted.To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model.Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.
Poster
Sheng-Yu Huang · Zi-Ting Chou · Yu-Chiang Frank Wang

[ ExHall D ]

Abstract
When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes. Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.
Poster
Rundi Wu · Ruiqi Gao · Ben Poole · Alex Trevithick · Changxi Zheng · Jonathan T. Barron · Aleksander Holynski

[ ExHall D ]

Abstract
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos.
Poster
Xinpeng Liu · Zeyi Huang · Fumio Okura · Yasuyuki Matsushita

[ ExHall D ]

Abstract
Novel view synthesis has demonstrated impressive progress recently, with 3D Gaussian splatting (3DGS) offering efficient training time and photorealistic real-time rendering. However, reliance on Cartesian coordinates limits 3DGS's performance on distant objects, which is important for reconstructing unbounded outdoor environments. We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. HoGS effectively manages both expansive spatial positions and scales particularly in outdoor unbounded environments by adopting projective geometry principles. Experiments show that HoGS significantly enhances accuracy in reconstructing distant objects while maintaining high-quality rendering of nearby objects, along with fast training speed and real-time rendering capability. Our implementation will be released upon acceptance.
Poster
Zilong Huang · Jun He · Junyan Ye · Lihan Jiang · Weijia Li · Yiping Chen · Ting Han

[ ExHall D ]

Abstract
The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Our Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and …
Poster
Xiaoqian Ruan · Pei Yu · Dian Jia · Hyeonjeong Park · Peixi Xiong · Wei Tang

[ ExHall D ]

Abstract
Reconstructing the 3D shape of an object from a single-view image is a fundamental task in computer vision. Recent advances in differentiable rendering have enabled 3D reconstruction from image collections using only 2D annotations. However, these methods mainly focus on whole-object reconstruction and overlook object partonomy, which is essential for intelligent agents interacting with physical environments. This paper aims at learning partonomic 3D reconstruction from collections of images with only 2D annotations. Our goal is not only to reconstruct the shape of an object from a single-view image but also to decompose the shape into meaningful semantic parts. To handle the expanded solution space and frequent part occlusions in single-view images, we introduce a novel approach that represents, parses, and learns the structural compositionality of 3D objects. This approach comprises: (1) a compact and expressive compositional representation of object geometry, achieved through disentangled modeling of large shape variations, constituent parts, and detailed part deformations as multi-granularity neural fields; (2) a part transformer that recovers precise partonomic geometry and handles occlusions, through effective part-to-pixel grounding and part-to-part relational modeling; and (3) a self-supervised method that jointly learns the compositional representation and part transformer, by bridging object shape and parts, image synthesis, …
Poster
Jay Zhangjie Wu · Alex Zhang · Haithem Turki · Xuanchi Ren · Jun Gao · Mike Zheng Shou · Sanja Fidler · Žan Gojčič · Huan Ling

[ ExHall D ]

Abstract
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2x improvement in FID score over baselines while maintaining 3D consistency.
Poster
Hanyang Kong · Xingyi Yang · Xinchao Wang

[ ExHall D ]

Abstract
Novel view synthesis from limited observations remains a significant challenge due to the lack of information in under-sampled regions, often resulting in noticeable artifacts. We introduce Generative Sparse-view Gaussian Splatting (GS-GS), a general pipeline designed to enhance the rendering quality of 3D/4D Gaussian Splatting (GS) when training views are sparse. Our method generates unseen views using generative models, specifically leveraging pre-trained image diffusion models to iteratively refine view consistency and hallucinate additional images at pseudo views. This approach improves 3D/4D scene reconstruction by explicitly enforcing semantic correspondences during the generation of unseen views, thereby enhancing geometric consistency—unlike purely generative methods that often fail to maintain view consistency. Extensive evaluations on various 3D/4D datasets—including Blender, LLFF, Mip-NeRF360, and Neural 3D Video—demonstrate that our GS-GS outperforms existing state-of-the-art methods in rendering quality without sacrificing efficiency.
Poster
Noam Elata · Bahjat Kawar · Yaron Ostrovsky-Berman · Miriam Farber · Ron Sokolovsky

[ ExHall D ]

Abstract
Synthesizing a novel view from a single input image is a challenging task.Traditionally, this task was approached by estimating scene depth, warping, and inpainting, with machine learning models enabling parts of the pipeline.More recently, generative models are being increasingly employed in novel view synthesis (NVS), often encompassing the entire end-to-end system.In this work, we adapt a modern diffusion model architecture for end-to-end NVS in the pixel space, substantially outperforming previous state-of-the-art (SOTA) techniques.We explore different ways to encode geometric information into the network.Our experiments show that while these methods may enhance performance, their impact is minor compared to utilizing improved generative models.Moreover, we introduce a novel NVS training scheme that utilizes single-view datasets, capitalizing on their relative abundance compared to their multi-view counterparts.This leads to improved generalization capabilities to scenes with out-of-domain content.We plan to publish code and model weights upon acceptance.
Poster
Ruijie Lu · Yixin Chen · Junfeng Ni · Baoxiong Jia · Yu Liu · Diwen Wan · Gang Zeng · Siyuan Huang

[ ExHall D ]

Abstract
Repurposing pre-trained diffusion models has been proven to be effective for novel view synthesis (NVS). However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object novel view synthesis (NVS) in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model's capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside …
Poster
Youngkyoon Jang · Eduardo Pérez-Pellitero

[ ExHall D ]

Abstract
We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision with a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty levels; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with various sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.
Poster
Lihan Jiang · Kerui Ren · Mulin Yu · Linning Xu · Junting Dong · Tao Lu · Feng Zhao · Dahua Lin · Bo Dai

[ ExHall D ]

Abstract
Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. Our method addresses the key challenges of combining these perspectives with a new training strategy, overcoming viewpoint discrepancies to generate high-fidelity scenes. We also curated a high-quality aerial-to-ground view dataset encompassing both synthetic and real-world scene to advance further research. Experiments across diverse urban scene datasets confirms the effectiveness of our method.
Poster
Yulong Zheng · Zicheng Jiang · Shengfeng He · Yandu Sun · Junyu Dong · Huaidong Zhang · Yong Du

[ ExHall D ]

Abstract
Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints. However, these methods struggle in few-shot scenarios due to limited supervision. In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting. Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow. By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions. Experiments demonstrate that NexusGS significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin. Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods.
Poster
Yutao Tang · Yuxiang Guo · Deming Li · Cheng Peng

[ ExHall D ]

Abstract
Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.
Poster
Shangjin Zhai · Zhichao Ye · Jialin Liu · Weijian Xie · Jiaqi Hu · Zhen Peng · Hua Xue · Danpeng Chen · Xiaomeng Wang · Lei Yang · Nan Wang · Haomin Liu · Guofeng Zhang

[ ExHall D ]

Abstract
Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. Each video clip generation is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from the last generated clip, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
Poster
Mingzhi Pei · Xu Cao · Xiangyi Wang · Heng Guo · Zhanyu Ma

[ ExHall D ]

Abstract
Multi-view 3D reconstruction for reflective and textureless surfaces remains a challenging problem. Both camera pose calibration and shape reconstruction fail due to insufficient or unreliable visual features across views. To address these issues, we present PMNI (Pose-free Multiview Normal Integration), a novel neural surface reconstruction method that leverages surface normal maps instead of RGB images to incorporate rich geometric information. By enforcing geometric constraints from surface normals and multiview shape consistency within a neural signed distance function (SDF) optimization framework, PMNI robustly recovers both camera poses and high-fidelity surface geometry simultaneously. Experimental results on synthetic and real-world datasets show that our method achieves state-of-the-art performance in the reconstruction of reflective surfaces, even without reliable initial camera poses.
Poster
Bingbing Hu · Yanyan Li · rui xie · Bo Xu · Haoye Dong · Junfeng Yao · Gim Hee Lee

[ ExHall D ]

Abstract
Capturing the temporal evolution of Gaussian properties such as position, rotation, and scale is a challenging task due to the vast number of time-varying parameters and the limited photometric data available, which generally results in convergence issues, making it difficult to find an optimal solution. While feeding all inputs into an end-to-end neural network can effectively model complex temporal dynamics, this approach lacks explicit supervision and struggles to generate high-quality transformation fields. On the other hand, using time-conditioned polynomial functions to model Gaussian trajectories and orientations provides a more explicit and interpretable solution, but requires significant handcrafted effort and lacks generalizability across diverse scenes. To overcome these limitations, this paper introduces a novel approach based on a learnable infinite Taylor Formula to model the temporal evolution of Gaussians. This method offers both the flexibility of an implicit network-based approach and the interpretability of explicit polynomial functions, allowing for more robust and generalizable modeling of Gaussian dynamics across various dynamic scenes.Extensive experiments on dynamic novel view rendering task are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in this domain.
Poster
Joohyun Kwon · Hanbyel Cho · Junmo Kim

[ ExHall D ]

Abstract
Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which handles dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field potentially resulting from the editing process, we additionally conducted a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that our method is efficient, reducing editing time by more than half compared to existing methods, while achieving high editing quality that better follows user instructions.
Poster
Jongmin Park · Minh-Quan Viet Bui · Juan Luis Gonzalez Bello · Jaeho Moon · Jihyong Oh · Munchurl Kim

[ ExHall D ]

Abstract
Synthesizing novel views from in-the-wild monocular videos is challenging due to scene dynamics and the lack of multi-view cues. To address this, we propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting (3DGS) framework for high-quality reconstruction and fast rendering from monocular videos. At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories using cubic Hermite splines with a small number of control points. For MAS, we introduce a Motion-Adaptive Control points Pruning (MACP) method to model the deformation of each dynamic 3D Gaussian across varying motions, progressively pruning control points while maintaining dynamic modeling integrity. Additionally, we present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency. This eliminates the need for Structure-from-Motion preprocessing and enhances SplineGS’s robustness in real-world conditions. Experiments show that SplineGS significantly outperforms state-of-the-art methods in novel view synthesis quality for dynamic scenes from monocular videos, achieving thousands times faster rendering speed.
Poster
Toshiya Yura · Ashkan Mirzaei · Igor Gilitschenski

[ ExHall D ]

Abstract
We introduce a method for using event camera data in novel view synthesis via Gaussian Splatting.Event cameras offer exceptional temporal resolution and a high dynamic range. Leveraging these capabilities allows us to effectively address the novel view synthesis challenge in the presence of fast camera motion.For initialization of the optimization process, our approach uses prior knowledge encoded in an event-to-video model. We also use spline interpolation for obtaining high quality poses along the event camera trajectory. This enhances the reconstruction quality from fast-moving cameras while overcoming the computational limitations traditionally associated with event-based Neural Radiance Field (NeRF) methods. Our experimental evaluation demonstrates that our results achieve higher visual fidelity and better performance than existing event-based NeRF approaches while being an order of magnitude faster to render.
Poster
Jianping Jiang · Weiye Xiao · Zhengyu Lin · Huaizhong Zhang · Tianxiang Ren · Yang Gao · Zhiqian Lin · Zhongang Cai · Lei Yang · Ziwei Liu

[ ExHall D ]

Abstract
Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: 1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. 2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. 3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
Poster
Aleksei Zhuravlev · Zorah Lähner · Vladislav Golyanik

[ ExHall D ]

Abstract
Estimating correspondences between pairs of deformable shapes remains challenging. Despite substantial progress, existing methods lack broad generalization capabilities and require domain-specific training data. To address these limitations, we propose a fundamentally new approach to shape correspondence based on denoising diffusion models. In our method, a diffusion model learns to directly predict the functional map, i.e. a low-dimensional representation for a point-wise map between shapes. We use a large dataset of synthetic human meshes for training and apply two steps to reduce the number of functional maps that need to be learned. First, maps refer to a template rather than to shape pairs. Second, a functional map is defined in the basis of eigenvectors of the Laplacian, which is not unique due to sign ambiguity. We, hence, introduce an unsupervised approach to select a specific basis by correcting the signs of eigenvectors based on surface features. Our approach achieves superior performance on standard human datasets, meshes with anisotropic connectivity, and non-isometric humanoid shapes compared to existing descriptor-based and large-scale shape deformation methods. We will release the source code and the datasets for reproducibility and research purposes.
Poster
Ziyuan Qu · Zihao Zou · Vivek Boominathan · Praneeth Chakravarthula · Adithya Pediredla

[ ExHall D ]

Abstract
Event cameras, which feature pixels that independently respond to changes in brightness, are becoming increasingly popular in high-speed applications due to their lower latency, reduced bandwidth requirements, and enhanced dynamic range compared to traditional frame-based cameras. Numerous imaging and vision techniques have leveraged event cameras for high-speed scene understanding by capturing high-framerate, high-dynamic range videos, primarily utilizing the temporal advantages inherent to event cameras. Additionally, imaging and vision techniques have utilized the light field---a complementary dimension to temporal information---for enhanced scene understanding. In this work, we propose "Event Fields", a new approach that utilizes innovative optical designs for event cameras to capture light fields at high speed. We develop the underlying mathematical framework for Event Fields and introduce two foundational frameworks to capture them practically: spatial multiplexing to capture temporal derivatives and temporal multiplexing to capture angular derivatives. To realize these, we design two complementary optical setups---one using a kaleidoscope for spatial multiplexing and another using a galvanometer for temporal multiplexing. We evaluate the performance of both designs using a custom-built simulator and real hardware prototypes, showcasing their distinct benefits. Our event fields unlock the full advantages of typical light fields—like post-capture refocusing and depth estimation—now supercharged for high-speed and …
Poster
Hidenobu Matsuki · Gwangbin Bae · Andrew J. Davison

[ ExHall D ]

Abstract
We propose the first tracking and mapping approach for a single RGB-D camera capable of non-rigid surface reconstruction via differentiable rendering. We perform 4D scene capture from an online stream by joint optimization of geometry, appearance, dynamics, and camera ego-motion. Although the natural environment contains complex non-rigid motions, non-rigid SLAM has remained difficult; even with 2.5D sensor measurements, it is still ill-posed due to the high dimensionality of the optimization problem. Our novel SLAM method based on Gaussian surface primitives allows accurate 3D reconstruction and real-time rendering without any template, using a warp-field represented by a multi-layer perceptron (MLP) and regularization terms to enable spatio-temporal reconstruction. A challenge in non-rigid SLAM research is the lack of publicly available datasets with reliable ground truth and standardized evaluation protocols. To address this, we introduce a novel synthetic dataset of everyday objects featuring diverse motions, leveraging availability of large-scale objects and advancements in animation modeling.
Poster
Jian Huang · Chengrui Dong · Xuanhua Chen · Peidong Liu

[ ExHall D ]

Abstract
Implicit neural representation and explicit 3D Gaussian Splatting (3D-GS) for novel view synthesis have achieved remarkable progress with frame-based camera (e.g. RGB and RGB-D cameras) recently. Compared to frame-based camera, a novel type of bio-inspired visual sensor, \ie event camera, has demonstrated advantages in high temporal resolution, high dynamic range, low power consumption and low latency, which make it being favored for many robotic applications. In this work, we present IncEventGS, an incremental 3D Gaussian Splatting reconstruction algorithm with a single event camera, without the assumption of known camera poses. To recover the 3D scene representation incrementally, we exploit the tracking and mapping paradigm of conventional SLAM pipelines for IncEventGS. Given the incoming event stream, the tracker first estimates an initial camera motion based on prior reconstructed 3D-GS scene representation. The mapper then jointly refines both the 3D scene representation and camera motion based on the previously estimated motion trajectory from the tracker. The experimental results demonstrate that IncEventGS delivers superior performance compared to prior NeRF-based methods and other related baselines, even we do not have the ground-truth camera poses. Furthermore, our method can also deliver better performance compared to state-of-the-art event visual odometry methods in terms of camera motion …
Poster
Zhiqiang Yan · Zhengxue Wang · Kun Wang · Jun Li · Jian Yang

[ ExHall D ]

Abstract
In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates the mismatch and ambiguity caused by direct convolution over irregularly sampled sparse data. Subsequently, SigNet redefines completion as enhancement, establishing a self-supervised degradation bridge between the coarse depth and the targeted dense depth for effective RGB-D fusion. To achieve this, SigNet leverages the implicit degradation to adaptively select high-frequency components (e.g., edges) of RGB data to compensate for the coarse depth. This degradation is further integrated into a multi-modal conditional Mamba, dynamically generating the state coefficients to enable efficient global high-frequency information interaction. We conduct extensive experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the state-of-the-art (SOTA) performance of SigNet.
Poster
Nikhil Behari · Aaron Young · Siddharth Somasundaram · Tzofi Klinghoffer · Akshat Dave · Ramesh Raskar

[ ExHall D ]

Abstract
3D surface reconstruction is essential across applications of virtual reality, robotics, and mobile scanning. However, RGB-based reconstruction often fails in low-texture, low-light, and low-albedo scenes. Handheld LiDARs, now common on mobile devices, aim to address these challenges by capturing depth information from time-of-flight measurements of a coarse grid of projected dots. Yet, these sparse LiDARs struggle with scene coverage on limited input views, leaving large gaps in depth information. In this work, we propose using an alternative class of "blurred" LiDAR that emits a diffuse flash, greatly improving scene coverage but introducing spatial ambiguity from mixed time-of-flight measurements across a wide field of view. To handle these ambiguities, we propose leveraging the complementary strengths of diffuse LiDAR with RGB. We introduce a Gaussian surfel-based rendering framework with a scene-adaptive loss function that dynamically balances RGB and diffuse LiDAR signals. We demonstrate that, surprisingly, diffuse LiDAR can outperform traditional sparse LiDAR, enabling robust 3D scanning with accurate color and geometry estimation in challenging environments.
Poster
Junjie Luo · John Mamish · Alan Fu · Thomas Concannon · Josiah Hester · Emma Alexander · Qi Guo

[ ExHall D ]

Abstract
Depth cameras promise to revolutionize mobile systems, but their size and power consumption limit their adoption.In this work we introduce Focal Split, the first handheld depth-from-differential-defocus (DfDD) camera with fully onboard power and compute. Unlike active illumination systems like LiDAR, we avoid power consumption associated with light sources, and our use of differential defocus sidesteps energy-intensive computation associated with passive triangulation methods like multi-view stereo and traditional depth-from-defocus.We extend DfDD theory around a portable, handheld opto-mechanical design which is robust due to its snapshot depth images. Our camera shows that a depth-from-defocus system can feasibly be operated in real-time on resource-constrained systems, with a battery life of 2 hours. Focal Split is DIY friendly. We include a guide to building the depth sensor using off-the-shelf optics, circuits, and mechanics with 3D-printed housing under $500.
Poster
Mehdi Zayene · Albias Havolli · Jannik Endres · Charles Corbière · Alexandre Ben Ahmed Kontouli · Salim Cherkaoui · Alex Alahi

[ ExHall D ]

Abstract
Despite considerable progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, consisting of 40K frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with diverse lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with a significantly increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. Results show that while recent stereo methods perform decently, a significant challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, achieving improved performance.
Poster
Pratheba Selvaraju · Victoria Abrevaya · Timo Bolkart · Rick Akkerman · Tianyu Ding · Faezeh Amjadi · Ilya Zharkov

[ ExHall D ]

Abstract
Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity, where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. Although this addresses the ambiguity problem, the challenge remains to pick the best matching shape to ensure consistency across diverse expressions. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on the predicted shape accuracy scores to select the best match. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results …
Poster
Yuliang Guo · Sparsh Garg · S. Mahdi H. Miangoleh · Xinyu Huang · Liu Ren

[ ExHall D ]

Abstract
Accurate metric depth estimation from monocular cameras is essential for applications such as autonomous driving, AR/VR, and robotics. While recent depth estimation methods demonstrate strong zero-shot generalization, achieving accurate metric depth across diverse camera types—particularly those with large fields of view (FoV) like fisheye and 360 cameras—remains challenging. This paper introduces Depth Any Camera (DAC), a novel zero-shot metric depth estimation framework that extends a perspective-trained model to handle varying FoVs effectively. Notably, DAC is trained exclusively on perspective images, yet it generalizes seamlessly to fisheye and 360 cameras without requiring specialized training. DAC leverages Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Key components include an efficient Image-to-ERP patch conversion for online ERP-space augmentation, a FoV alignment operation to support effective training across a broad range of FoVs, and multi-resolution data augmentation to address resolution discrepancies between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving δ1 accuracy by up to 50\% on multiple indoor fisheye and 360 datasets, demonstrating robust generalization across camera types while relying only on perspective training data.
Poster
Marvin Anas Hahn · Kathlén Kohn · Orlando Marigliano · Tomas Pajdla

[ ExHall D ]

Abstract
Rolling shutter (RS) cameras dominate consumer and smartphone markets. Several methods for computing the absolute pose of RS cameras have appeared in the last 20 years, but the relative pose problem has not been fully solved yet. We provide a unified theory for the important class of order-one rolling shutter (RS1) cameras. These cameras generalize the perspective projection to RS cameras, projecting a generic space point to exactly one image point via a rational map. We introduce a new back-projection RS camera model, characterize RS1 cameras, construct explicit parameterizations of such cameras, and determine the image of a space line. We classify all minimal problems for solving the relative camera pose problem with linear RS1 cameras and discover new practical cases. Finally, we show how the theory can be used to explain RS models previously used for absolute pose computation.
Poster
Daniel Safari

[ ExHall D ]

Abstract
Research on bundle adjustment has focused on photo collections where each image is accompanied by its own set of camera parameters. However, real-world applications overwhelmingly call for shared intrinsics bundle adjustment (SI-BA) where camera parameters are shared across multiple images. Utilizing overlooked optimization opportunities specific to SI-BA, most notably matrix-free computation, we present a solver that is eight times faster than alternatives while consuming a tenth of the memory. Additionally, we examine reasons for BA instability under single-precision computation and propose minimal mitigations.
Poster
Jiachen Liu · Rui Yu · Sili Chen · Sharon X. Huang · Hengkai Guo

[ ExHall D ]

Abstract
3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driving models on multiple domains, we have curated a large-scale (over 14 datasets and 560,000 images), high-resolution, densely-annotated planar benchmark from various indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. We will release all of the labeled data, code and models upon the acceptance of this paper.
Poster
Pengju Sun · Banglei Guan · Zhenbao Yu · Yang Shang · Qifeng Yu · Daniel Barath

[ ExHall D ]

Abstract
Abstract—Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The source code will be made public.
Poster
Jianping Wu

[ ExHall D ]

Abstract
A novel vanishing point (VP) detection scheme, DiskVPS detects VPs with extreme efficiency via Hough Transform (HT) over an image-plane-mapped disk space. DiskVPS differs from the state-of-the-art (SOTA) algorithms that use the Gaussian Sphere(GS)-based VP detection models in which camera parameters are required and edge pairs cast votes. DiskVPS approach has two fundamental advantages in comparison to the other VP detection schemes: 1) the potential to achieve substantially higher accuracy at significantly faster processing speed by using individual edges rather than more error-prone and less efficient edge pairs as voters, and 2) the application of VP detection to all image types without the need for calibration as no camera parameters are involved in the algorithm. In a comparative experimental study, we demonstrate that the DiskVPS significantly outperforms the SOTA in detection accuracy and processing speed with real-world images.
Poster
Zhiwei Huang · Hailin Yu · Yichun Shentu · Jin Yuan · Guofeng Zhang

[ ExHall D ]

Abstract
This paper presents a novel camera relocalization method, STDLoc, which leverages feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localization paradigm. Based on this scene representation, we introduce a novel matching-oriented Gaussian sampling strategy and a scene-specific detector to achieve efficient and robust initial pose estimation. Furthermore, based on the initial localization results, we align the query feature map to the Gaussian feature field by dense feature matching to enable accurate localization. The experiments on indoor and outdoor datasets show that STDLoc outperforms current state-of-the-art localization methods in terms of localization accuracy and recall. The code will be released after the paper is accepted.
Poster
Thibaut Loiseau · Guillaume Bourmaud

[ ExHall D ]

Abstract
Camera pose estimation is crucial for many computer vision applications, yet existing benchmarks offer limited insight into method limitations across different geometric challenges. We introduce RUBIK, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels. Using three complementary criteria - overlap, scale ratio, and viewpoint angle - we organize 16.5K image pairs from nuScenes into 33 difficulty levels. Our comprehensive evaluation of 14 methods reveals that while recent detector-free approaches achieve the best performance (>47% success rate), they come with significant computational overhead compared to detector-based methods (150-600ms vs. 40-70ms). Even the best performing method succeeds on only 54.8% of the pairs, highlighting substantial room for improvement, particularly in challenging scenarios combining low overlap, large scale differences, and extreme viewpoint changes. Benchmark will be made publicly available.
Poster
Fei Xue · Sven Elflein · Laura Leal-Taixe · Qunjie Zhou

[ ExHall D ]

Abstract
Establishing correspondences across images is a fundamental challenge in computer vision, underpinning tasks like Structure-from-Motion, image editing, and point tracking. Traditional methods are often specialized for specific correspondence types-geometric, semantic, or temporal — whereas humans naturally identify alignments across these domains. Inspired by this flexibility, we propose MATCHA, a unified feature model designed to “rule them all”, establishing robust correspondences across diverse matching tasks. Building on insights that diffusion model features can encode multiple correspondence types, MATCHA augments this capacity by dynamically fusing high-level semantic and low-level geometric features through an attention-based module, creating expressive, versatile, and robust features. Additionally, MATCHA integrates object-level features from DINOv2 to further boost generalization, enabling a single feature capable of “matching anything.” Extensive experiments validate that MATCHA consistently surpasses state-of-the-art methods across geometric, semantic, and temporal tasks, setting a new foundation for a unified approach for the fundamental correspondence problem in computer vision. To the best of our knowledge, MATCHA is the first approach that is able to effectively tackle diverse matching tasks with a single unified feature.
Poster
Junwei Zheng · Ruiping Liu · Yufan Chen · Zhenfang Chen · Kailun Yang · Jiaming Zhang · Rainer Stiefelhagen

[ ExHall D ]

Abstract
Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at 3 different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. While the local branch focuses on the poses between consecutive adjacent frames, the global branch is designed for the pose between the query and origin frame. Extensive experiments and studies demonstrate the effectiveness of our SPR task, dataset, and methods. In unknown 360SPR scenes, our method outperforms APR (27.45m/47.01°) and RPR (11.92m/21.27°), achieving a significant reduction of error to 3.85m/3.97°. The dataset and code will be made publicly available.
Poster
Xinyue Zhang · Zijia Dai · Wanting Xu · Laurent Kneip

[ ExHall D ]

Abstract
While automatically generated polynomial elimination templates have sparked great progress in the field of 3D computer vision, there remain many problems for which the degree of the constraints or the number of unknowns leads to intractability. In recent years, homotopy continuation has been introduced as a plausible alternative. However, the method currently depends on expensive parallel tracking of all possible solutions in the complex domain, or a classification network for starting problem-solution pairs trained over a limited set of real-world examples. Our innovation lies in a novel approach to finding solution-problem pairs, where we only need to predict a rough initial solution, with the corresponding problem generated by an online simulator. Subsequently, homotopy continuation is applied to track that single solution back to the original problem. We apply this elegant combination to generalized camera resectioning, and also introduce a new solution to the challenging generalized relative pose and scale problem. As demonstrated, the proposed method successfully compensates the raw error committed by the regressor alone, and leads to state-of-the-art efficiency and success rates.
Poster
Shujuan Li · Yu-Shen Liu · Zhizhong Han

[ ExHall D ]

Abstract
Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn the continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representations, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries.
Poster
Miroslav Purkrábek · Jiri Matas

[ ExHall D ]

Abstract
Current Human Pose Estimation methods have achieved significant improvements. However, state-of-the-art models ignore out-of-image keypoints and use uncalibrated heatmaps as keypoint location representation. To address these limitations, we propose ProbPose, which predicts for each keypoint: a calibrated probability of keypoint presence at each location in the activation window, the probability of being outside of it, and its predicted visibility. To address the lack of evaluation protocols for out-of-image keypoints, we introduce the CropCOCO dataset and the Extended OKS (Ex-OKS) metric, which extends OKS to out-of-image points. Tested on COCO, CropCOCO, and OCHuman, ProbPose shows significant gains in out-of-image keypoint localization while also improving in-image localization through data augmentation. Additionally, the model improves robustness along the edges of the bounding box and offers better flexibility in keypoint evaluation. The codeand models will be released on the project website for research purposes.
Poster
Yunze Man · Yichen Sheng · Jianming Zhang · Liangyan Gui · Yu-Xiong Wang

[ ExHall D ]

Abstract
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.
Poster
Guo Junfu · Yu Xin · Gaoyi Liu · Kai Xu · Ligang Liu · Ruizhen Hu

[ ExHall D ]

Abstract
We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.
Poster
Weihang Li · Hongli XU · Junwen Huang · HyunJun Jung · Kuan-Ting Yu · Nassir Navab · Benjamin Busam

[ ExHall D ]

Abstract
A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose first performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance's global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We then introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275.
Poster
Yuanbo Xiangli · Ruojin Cai · Hanyu Chen · Jeffrey Byrne · Noah Snavely

[ ExHall D ]

Abstract
Accurate 3D reconstruction is frequently hindered by visual aliasing, where visually similar but distinct surfaces (aka, doppelgangers), are incorrectly matched. These spurious matches distort the structure-from-motion (SfM) process, leading to misplaced model elements and reduced accuracy. Prior efforts addressed this with CNN classifiers trained on curated datasets, but these approaches struggle to generalize across diverse real-world scenes and can require extensive parameter tuning. In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. Our contributions include a diversified training dataset that incorporates geo-tagged images from everyday scenes to expand robustness beyond landmark-based datasets. We further propose a Transformer-based classifier that leverages 3D-aware features from the MASt3R model, achieving superior precision and recall across both in-domain and out-of-domain tests. Doppelgangers++ integrates seamlessly into standard SfM and MASt3R-SfM pipelines, offering efficiency and adaptability across varied scenes. To evaluate SfM accuracy, we introduce an automated, geotag-based method for validating reconstructed models, eliminating the need for manual inspection. Through extensive experiments, we demonstrate that Doppelgangers++ significantly enhances pairwise visual disambiguation and improves 3D reconstruction quality in complex and diverse scenarios.
Poster
Mengjie Xu · Yitao Zhu · Haotian Jiang · Jiaming Li · Zhenrong Shen · Sheng Wang · Haolin Huang · Xinyu Wang · Han Zhang · Qing Yang · Qian Wang

[ ExHall D ]

Abstract
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird’s eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance.
Poster
Friedhelm Hamann · Daniel Gehrig · Filbert Febryanto · Kostas Daniilidis · Guillermo Gallego

[ ExHall D ]

Abstract
Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion - thereby addressing an open challenge in purely event-based tracking - with a novel feature alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 135% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 19% improvement over the previous best event-only method and even …
Poster
Hoonhee Cho · Jae-Young Kang · Youngho Kim · Kuk-Jin Yoon

[ ExHall D ]

Abstract
Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. Our code and dataset will be publicly available.
Poster
Zechuan Li · Hongshan Yu · Yihao Ding · Jinhao Qiao · Basim Azam · Naveed Akhtar

[ ExHall D ]

Abstract
We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields (NeRF). The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Our code and models will be made public after acceptance.
Poster
Shin-Fang Chng · Hemanth Saratchandran · Simon Lucey

[ ExHall D ]

Abstract
Neural fields encode continuous multidimensional signals as neural networks, enabling diverse applications in computer vision, robotics, and geometry. While Adam is effective for stochastic optimization, it often requires long training times. To address this, we explore alternative optimization techniques to accelerate training without sacrificing accuracy. Traditional second-order methods like L-BFGS are unsuitable for stochastic settings. We propose a theoretical framework for training neural fields with curvature-aware diagonal preconditioners, demonstrating their effectiveness across tasks such as image reconstruction, shape modeling, and Neural Radiance Fields (NeRF).
Poster
Chenhui Shi · Fulin Tang · Ning An · Yihong Wu

[ ExHall D ]

Abstract
We propose 3D-SLNR, a new and ultra-lightweight neural representation with outstanding performance for large-scale 3D mapping. The representation defines a global signed distance function (SDF) in near-surface space based on a set of band-limited local SDFs anchored at support points sampled from point clouds. These SDFs are parameterized only by a tiny multi-layer perceptron (MLP) with no latent features, and the state of each SDF is modulated by three learnable geometric properties: position, rotation, and scaling, which make the representation adapt to complex geometries. Then, we develop a novel parallel algorithm tailored for this unordered representation to efficiently detect local SDFs where each sampled point is located, allowing for real-time updates of local SDF states during training. Additionally, a prune-and-expand strategy is introduced to enhance adaptability further. The synergy of our low-parameter model and its adaptive capabilities results in an extremely compact representation with excellent expressiveness. Extensive experiments demonstrate that our method achieves state-of-the-art reconstruction performance with less than 1/5 of the memory footprint compared with previous advanced methods.
Poster
Guangshun Wei · Yuan Feng · Long Ma · Chen Wang · Yuanfeng Zhou · Changjian Li

[ ExHall D ]

Abstract
This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (i.e., images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details. The code, model, and datasets will be made publicly available upon publication.
Poster
Zikuan Li · Honghua Chen · Yuecheng Wang · Sibo Wu · Mingqiang Wei · Jun Wang

[ ExHall D ]

Abstract
Extracting geometric edges from unstructured point clouds remains a significant challenge, particularly in thin-walled structures that are commonly found in everyday objects. Traditional geometric methods and recent learning-based approaches frequently struggle with these structures, as both rely heavily on sufficient contextual information from local point neighborhoods. However, 3D measurement data of thin-walled structures often lack the accurate, dense, and regular neighborhood sampling required for reliable edge extraction, resulting in degraded performance.In this work, we introduce STAR-Edge, a novel approach designed for detecting and refining edge points in thin-walled structures. Our method leverages a unique representation—the local spherical curve—to create structure-aware neighborhoods that emphasize co-planar points while reducing interference from close-by, non-co-planar surfaces. This representation is transformed into a rotation-invariant descriptor, which, combined with a lightweight multi-layer perceptron, enables robust edge point classification even in the presence of noise and sparse or irregular sampling. Besides, we also use the local spherical curve representation to estimate more precise normals and introduce an optimization function to project initially identified edge points exactly on the true edges.Experiments conducted on the ABC dataset and thin-walled structure-specific datasets demonstrate that STAR-Edge outperforms existing edge detection methods, showcasing better robustness under various challenging conditions. The source code …
Poster
Zhangquan Chen · Puhua Jiang · Ruqi Huang

[ ExHall D ]

Abstract
In this paper, we present DV-Matcher, a novel learning-based framework for estimating dense correspondences between non-rigidly deformable point clouds. Learning directly from unstructured point clouds without meshing or manual labelling, our framework delivers high-quality dense correspondences, which is of significant practical utility in point cloud processing. Our key contributions are two-fold: First, we propose a scheme to inject prior knowledge from pre-trained vision models into geometric feature learning, which effectively complements the local nature of geometric features with global and semantic information; Second, we propose a novel deformation-based module to promote the extrinsic alignment induced by the learned correspondences, which effectively enhances the feature learning. Experimental results show that our method achieves state-of-the-art results in matching non-rigid point clouds in both near-isometric and heterogeneous shape collection as well as more realistic partial and noisy data.
Poster
Ruiqi Zhang · Hao Zhu · Jingyi Zhao · Qi Zhang · Xun Cao · Zhan Ma

[ ExHall D ]

Abstract
3D classification with point cloud input is a fundamental problem in 3D vision. However, due to the discrete nature and the insufficient material description of point cloud representations, there are ambiguities in distinguishing wire-like and flat surfaces, as well as transparent or reflective objects. To address these issues, we propose Gaussian Splatting (GS) point cloud-based 3D classification. We find that the scale and rotation coefficients in the GS point cloud help characterize surface types. Specifically, wire-like surfaces consist of multiple slender Gaussian ellipsoids, while flat surfaces are composed of a few flat Gaussian ellipsoids. Additionally, the opacity in the GS point cloud represents the transparency characteristics of objects. As a result, ambiguities in point cloud-based 3D classification can be mitigated utilizing GS point cloud as input. To verify the effectiveness of GS point cloud input, we construct the first real-world GS point cloud dataset in the community, which includes 20 categories with 200 objects in each category. Experiments not only validate the superiority of GS point cloud input, especially in distinguishing ambiguous objects, but also demonstrate the generalization ability across different classification methods.
Poster
Changfeng Ma · Ran Bi · Jie Guo · Chongjun Wang · Yanwen Guo

[ ExHall D ]

Abstract
Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rending method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance.
Poster
Jinfeng Xu · Xianzhi Li · Yuan Tang · Xu Han · Qiao Yu · yixue Hao · Long Hu · Min Chen

[ ExHall D ]

Abstract
Recent advancements in deep learning have greatly enhanced 3D object recognition, but most models are limited to closed-set scenarios, unable to handle unknown samples in real-world applications. Open-set recognition (OSR) addresses this limitation by enabling models to both classify known classes and identify novel classes.However, current OSR methods rely on global features to differentiate known and unknown classes, treating the entire object uniformly and overlooking the varying semantic importance of its different parts.To address this gap, we propose Salience-Aware Structured Separation (SASep), which includes (i) a tunable semantic decomposition (TSD) module to semantically decompose objects into important and unimportant parts, (ii) a geometric synthesis strategy (GSS) to generate pseudo-unknown objects by combining these unimportant parts, and (iii) a synth-aided margin separation (SMS) module to enhance feature-level separation by expanding the feature distributions between classes.Together, these components improve both geometric and feature representations, enhancing the model’s ability to effectively distinguish known and unknown classes.Experimental results show that SASep achieves superior performance in 3D OSR, outperforming existing state-of-the-art methods. We shall release our code and models upon publication of this work.
Poster
Xinjie Wang · Yifan Zhang · Ting Liu · Xinpu Liu · Ke Xu · Jianwei Wan · Yulan Guo · Hanyun Wang

[ ExHall D ]

Abstract
Efficient Point Cloud Geometry Compression (PCGC) with a lower bits per point (BPP) and higher peak signal-to-noise ratio (PSNR) is essential for the transportation of large-scale 3D data. Although octree-based entropy models can reduce BPP without introducing geometry distortion, existing CNN-based models struggle with limited receptive fields to capture long-range dependencies, while Transformer-built architectures always neglect fine-grained details due to their reliance on global self-attention. In this paper, we propose a Transformer-efficient occupancy prediction network, termed TopNet, to overcome these challenges by developing several novel components: Locally-enhanced Context Encoding (LeCE) for enhancing the translation-invariance of the octree nodes, Adaptive-Length Sliding Window Attention (AL-SWA) for capturing both global and local dependencies while adaptively adjusting attention weights based on the input window length, Spatial-Gated-enhanced Channel Mixer (SG-CM) for efficient feature aggregation from ancestors and siblings, and Latent-guided Node Occupancy Predictor (LNOP) for improving prediction accuracy of spatially adjacent octree nodes. Comprehensive experiments across both indoor and outdoor point cloud datasets demonstrate that our TopNet achieves state-of-the-art performance with fewer parameters, further advancing the reduction-efficiency boundaries of PCGC.
Poster
Qiang Li · Jian Ruan · Fanghao Wu · Yuchi Chen · Zhihua Wei · Wen Shen

[ ExHall D ]

Abstract
Recently, many self-supervised pre-training methods have been proposed to improve the performance of deep neural networks (DNNs) for 3D point clouds processing. However, the common mechanism underlying the effectiveness of different pre-training methods remains unclear. In this paper, we use game-theoretic interactions as a unified approach to explore the common mechanism of pre-training methods. Specifically, we decompose the output score of a DNN into the sum of numerous effects of interactions, with each interaction representing a distinct 3D substructure of the input point cloud. Based on the decomposed interactions, we draw the following conclusions. (1) The common mechanism across different pre-training methods is that they enhance the strength of high-order interactions encoded by DNNs, which represent complex and global 3D structures, while reducing the strength of low-order interactions, which represent simple and local 3D structures. (2) Sufficient pre-training and adequate fine-tuning data for downstream tasks further reinforce the mechanism described above. (3) Pre-training methods carry a potential risk of reducing the transferability of features encoded by DNNs. Inspired by the observed common mechanism, we propose a new method to directly enhance the strength of high-order interactions and reduce the strength of low-order interactions encoded by DNNs, improving performance without the …
Poster
Wentao Qu · Jing Wang · Yongshun Gong · Xiaoshui Huang · Liang Xiao

[ ExHall D ]

Abstract
Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic Segmentation Network based on a \textbf{C}onditional-Noise Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.
Poster
Sifan Zhou · Zhihang Yuan · Dawei Yang · Ziyu Zhao · Jian Qian · Xing Hu

[ ExHall D ]

Abstract
Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder, called PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency …
Poster
Yante Li · Hanwen Qi · Haoyu Chen · Liang Xinlian · Guoying Zhao

[ ExHall D ]

Abstract
In environmental protection, tree monitoring plays an essential role in maintaining and improving ecosystem health. However, precise monitoring is challenging because existing datasets fail to capture continuous fine-grained changes in trees due to low-resolution images and high acquisition costs. In this paper, we introduce UAVTC, a large-scale, long-term, high-resolution dataset collected using UAVs equipped with cameras, specifically designed to detect individual Tree Changes (TCs). UAVTC includes rich annotations and statistics based on biological knowledge, offering a fine-grained view for tree monitoring. To address environmental influences and effectively model the hierarchical diversity of physiological TCs, we propose a novel Hyperbolic Siamese Network (HSN) for TC detection, enabling compact and hierarchical representations of dynamic tree changes. Extensive experiments show that HSN can effectively capture complex hierarchical changes and provide a robust solution for fine-grained TC detection. In addition, HSN generalizes well to cross-domain face anti-spoofing task, highlighting its broader significance in AI. We believe our work, combining ecological insights and interdisciplinary expertise, will benefit the community by offering a new benchmark and innovative AI technologies. Source code and dataset will be made available.
Poster
Dušan Malić · Christian Fruhwirth-Reisinger · Samuel Schulter · Horst Possegger

[ ExHall D ]

Abstract
LiDAR-based 3D detectors need large datasets for training, yet they struggle to generalize to novel domains. Domain Generalization (DG) aims to mitigate this by training detectors that are invariant to such domain shifts. Current DG approaches exclusively rely on global geometric features (point cloud Cartesian coordinates) as input features. Over-reliance on these global geometric features can, however, cause 3D detectors to prioritize object location and absolute position, resulting in poor cross-domain performance. To mitigate this, we propose to exploit explicit local point cloud structure for DG, in particular by encoding point cloud neighborhoods with Gaussian blobs, GBlobs. Our proposed formulation is highly efficient and requires no additional parameters. Without any bells and whistles, simply by integrating GBlobs in existing detectors, we beat the current state-of-the-art in challenging single-source DG benchmarks by over 21 mAP (Waymo->KITTI), 13 mAP (KITTI->Waymo), and 12 mAP (nuScenes->KITTI), without sacrificing in-domain performance. Additionally, GBlobs demonstrate exceptional performance in multi-source DG, surpassing the current state-of-the-art by 17, 12, and 5 mAP on Waymo, KITTI, and ONCE, respectively.
Poster
Xiang Xu · Lingdong Kong · hui shuai · Liang Pan · Ziwei Liu · Qingshan Liu

[ ExHall D ]

Abstract
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across 11 large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code will be made publicly accessible.
Poster
Chuandong Liu · Xingxing Weng · Shuguo Jiang · Pengcheng Li · Lei Yu · Gui-Song Xia

[ ExHall D ]

Abstract
This paper explores scene affinity (AIScene), namely intra-scene consistency and inter-scene correlation, for semi-supervised LiDAR semantic segmentation in driving scenes. Adopting teacher-student training, AIScene employs a teacher network to generate pseudo-labeled scenes from unlabeled data, which then supervise the student network's learning. Unlike most methods that include all points in pseudo-labeled scenes for forward propagation but only pseudo-labeled points for backpropagation, AIScene removes points without pseudo-labels, ensuring consistency in both forward and backward propagation within the scene. This simple point erasure strategy effectively prevents unsupervised, semantically ambiguous points (excluded in backpropagation) from affecting the learning of pseudo-labeled points. Moreover, AIScene incorporates patch-based data augmentation, mixing multiple scenes at both scene and instance levels. Compared to existing augmentation techniques that typically perform scene-level mixing between two scenes, our method enhances the semantic diversity of labeled (or pseudo-labeled) scenes, thereby improving the semi-supervised performance of segmentation models. Experiments show that AIScene outperforms previous methods on two popular benchmarks across four settings, achieving notable improvements of 1.9\% and 5.3\% in the most challenging 1\% labeled data.
Poster
Xun Huang · Jinlong Wang · Qiming Xia · Siheng Chen · Bisheng Yang · Xin Li · Cheng Wang · Chenglu Wen

[ ExHall D ]

Abstract
Current Vehicle-to-Everything (V2X) systems have significantly enhanced 3D object detection using LiDAR and camera data. However, they face performance degradation in adverse weather. Weather-robust 4D radar, with Doppler velocity and additional geometric information, offers a promising solution to this challenge. To this end, we present V2X-R, the first simulated V2X dataset incorporating LiDAR, camera, and 4D radar modalities. V2X-R contains 12,079 scenarios with 37,727 frames of LiDAR and 4D radar point clouds, 150,908 images, and 170,859 annotated 3D vehicle bounding boxes. Subsequently, we propose a novel cooperative LiDAR-4D radar fusion pipeline for 3D object detection and implement it with multiple fusion strategies. To achieve weather-robust detection, we additionally propose a Multi-modal Denoising Diffusion (MDD) module in our fusion pipeline. MDD utilizes weather-robust 4D radar feature as a condition to guide the diffusion model in denoising noisy LiDAR features.Experiments show that our LiDAR-4D radar fusion pipeline demonstrates superior performance in the V2X-R dataset. Over and above this, our MDD module further improved the foggy/snowy performance of the basic fusion model by up to 5.73\%/6.70\% and barely disrupting normal performance. The dataset and code will be publicly available.
Poster
Jinhyung Park · Navyata Sanghvi · Hiroki Adachi · Yoshihisa Shibata · Shawn Hunt · Shinya Tanaka · Hironobu Fujiyioshi · Kris Kitani

[ ExHall D ]

Abstract
While recent advancements in camera-based 3D object detection demonstrate remarkable performance, they require thousands or even millions of human-annotated frames. This requirement significantly inhibits their deployment in various locations and sensor configurations. To address this gap, we propose a performant semi-supervised framework that leverages unlabeled RGB-only driving sequences - data easily collected with cost-effective RGB cameras - to significantly improve temporal, camera-only 3D detectors. We observe that the standard semi-supervised pseudo-labeling paradigm under-performs in this temporal, camera-only setting due to poor 3D localization of pseudo-labels. To address this, we train a single 3D detector to handle RGB sequences both forwards and backwards in time, then ensemble both its forwards and backwards pseudo-labels for semi-supervised learning. We further improve the pseudo-label quality by leveraging 3D object tracking to in-fill missing detections and by eschewing simple confidence thresholding in favor of using the auxiliary 2D detection head to filter 3D predictions. Finally, to enable the backbone to learn directly from the unlabeled data itself, we introduce an object-query conditioned masked reconstruction objective. Our framework demonstrates remarkable performance improvement on large-scale autonomous driving datasets nuScenes and nuPlan.
Poster
ziteng xue · Mingzhe Guo · Heng Fan · Shihui Zhang · Zhipeng Zhang

[ ExHall D ]

Abstract
Camera-only multi-view 3D object detection in autonomous driving has witnessed encouraging developments in recent years, largely attributed to the revolution of fundamental architectures in modeling bird's eye view (BEV). Despite the growing overall average performance, we contend that the exploration of more specific and challenging corner cases hasn't received adequate attention. In this work, we delve into a specific yet critical issue for safe autonomous driving: occlusion. To alleviate this challenge, we draw inspiration from the human amodal perception system, which is proven to have the capacity for mentally reconstructing the complete semantic concept of occluded objects with prior knowledge. More specifically, we introduce auxiliary visual and language prototypes, akin to human prior knowledge, to enhance the diminished object features caused by occlusion. Inspired by Siamese object tracking, we fuse the information from these prototypes with the baseline model through an efficient depth-wise correlation, thereby enhancing the quality of object-related features and guiding the learning of 3D object queries, especially for partially occluded ones. Furthermore, we propose the random pixel drop to mimic occlusion and the multi-modal contrastive loss to align visual features of different occlusion levels to a unified space during training. Our inspiration originates from addressing occlusion, however, …
Poster
Hermann Blum · Alessandro Mercurio · Joshua O'Reilly · Tim Engelbracht · Mihai Dusmanu · Marc Pollefeys · Zuria Bauer

[ ExHall D ]

Abstract
Accurate localization plays a pivotal role in the autonomy of systems operating in unfamiliar environments, particularly when interaction with humans is expected. High-accuracy visual localization systems encompass various components, such as feature extractors, matchers, and pose estimation methods. This complexity translates to the necessity of robust evaluation settings and pipelines. However, existing datasets and benchmarks primarily focus on single-agent scenarios, overlooking the critical issue of cross-device localization. Different agents with different sensors will show their own specific strengths and weaknesses, and the data they have available varies substantially. This work addresses this gap by enhancing an existing augmented reality visual localization benchmark with data from legged robots, and evaluating human-robot, cross-device mapping and localization. Our contributions extend beyond device diversity and include high environment variability, spanning ten distinct locations ranging from disaster sites to art exhibitions. Each scene in our dataset features recordings from robot agents, hand-held and head-mounted devices, and high-accuracy ground truth LiDAR scanners, resulting in a comprehensive multi-agent dataset and benchmark. This work represents a significant advancement in the field of visual localization benchmarking, with key insights into the performance of cross-device localization methods across diverse settings.
Poster
Tomas Soucek · Prajwal Gatti · Michael Wray · Ivan Laptev · Dima Damen · Josef Sivic

[ ExHall D ]

Abstract
The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with the provided input image. Third, we evaluate the generated image sequences across three dimensions of accuracy (step, scene, and task) and show our model achieves state-of-the-art results on all of them. Our code, dataset, and trained models will be publicly available.
Poster
Haisheng Su · Feixiang Song · CONG MA · Wei Wu · Junchi Yan

[ ExHall D ]

Abstract
Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full 360 view, forming 216K trajectories across 7.6K temporal sequences. It has 270× and 18× as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel …
Poster
Christopher Diehl · Quinlan Sykora · Ben Agro · Thomas Gilles · Sergio Casas · Raquel Urtasun

[ ExHall D ]

Abstract
We present DIO, a flexible world model that can estimate the scene occupancy-flow from a sparse set of LiDAR observations, and decompose it into individual instances. DIO can not only complete instance shapes at the present time, but also forecast their occupancy-flow evolution over a future horizon. Thanks to its flexible prompt representation, DIO can take instance prompts from off-the-shelf models like 3D detectors, achieving state-of-the-art performance in the task of 4D semantic occupancy completion and forecasting on the Argoverse 2 dataset. Moreover, our world model can easily and effectively be transferred to downstream tasks like LiDAR point cloud forecasting, ranking first compared to all baselines in the Argoverse 4D occupancy forecasting challenge.
Poster
Jonas Kälble · Sascha Wirges · Maxim Tatarchenko · Eddy Ilg

[ ExHall D ]

Abstract
We present EvOcc, a novel evidential semantic occupancy mapping framework. It consists of two parts: (1) an evidential approach for calculating the ground-truth 3D semantic occupancy maps from noisy LiDAR measurements, and (2) a method for training image-based occupancy estimation models through a new loss formulation. In contrast to state-of-the-art semantic occupancy maps, our approach explicitly models the uncertainty introduced by unobserved spaces or contradicting measurements and we show that using it results in significantly stronger models. Evaluated as ray-based mIoU, our evidential semantic occupancy mapping approach improves over the baselines by at least 15.8 for the ground truth and 5.5 for the trained model. Overall, we make a significant contribution towards more detailed and uncertainty-aware 3D environment understanding and safe operation in autonomous driving.
Poster
Yuanhui Huang · Amonnut Thammatadatrakoon · Wenzhao Zheng · Yunpeng Zhang · Dalong Du · Jiwen Lu

[ ExHall D ]

Abstract
3D semantic occupancy prediction has garnered attention as an important task for the robustness of vision-centric autonomous driving, which predicts fine-grained geometry and semantics of the surrounding scene. Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes, which leads to computational redundancy. Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. Furthermore, we adopt the exact Gaussian mixture model for semantics calculation to avoid unnecessary overlapping of Gaussians. To effectively initialize Gaussians in non-empty region, we design a distribution-based initialization module which learns the pixel-aligned occupancy distribution instead of the depth of surfaces. We conduct extensive experiments on nuScenes and KITTI-360 datasets and our GaussianFormer-2 achieves state-of-the-art performance with high efficiency.
Poster
Su Sun · Cheng Zhao · Zhuoyang Sun · Yingjie Chen · Mei Chen

[ ExHall D ]

Abstract
Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB/depth/flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively.NMFF also models the status correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic scene identification by distilling features from 2D foundational models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo Open Dataset and KITTI Dataset validate SplatFlow's state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios. The code and model will be released upon the paper's acceptance.
Poster
Hongbin Lin · Zilu Guo · Yifan Zhang · Shuaicheng Niu · Yafeng Li · Ruimao Zhang · Shuguang Cui · Zhen Li

[ ExHall D ]

Abstract
In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: we empirically find that self-attention features are semantic-aware but tend to be relatively coarse for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. …
Poster
Halil İbrahim Öztürk · Muhammet Esat Kalfaoglu · Ozsel Kilinc

[ ExHall D ]

Abstract
Accurate and efficient lane detection in 3D space is essential for autonomous driving systems, where robust generalization is the foremost requirement for 3D lane detection algorithms. Considering the extensive variation in lane structures worldwide, achieving high generalization capacity is particularly challenging, as algorithms must accurately identify a wide variety of lane patterns worldwide. Traditional top-down approaches rely heavily on learning lane characteristics from training datasets, often struggling with lanes exhibiting previously unseen attributes. To address this generalization limitation, we propose a method that detects keypoints of lanes and subsequently predicts sequential connections between them to construct complete 3D lanes. Each key point is essential for maintaining lane continuity, and we predict multiple proposals per keypoint by allowing adjacent grids to predict the same keypoint using an offset mechanism. PointNMS is employed to eliminate overlapping proposal keypoints, reducing redundancy in the estimated BEV graph and minimizing computational overhead from connection estimations. Our model surpasses previous state-of-the-art methods on both the Apollo and OpenLane datasets, demonstrating superior F1 scores and a strong generalization capacity when models trained on OpenLane are evaluated on the Apollo dataset, compared to prior approaches.
Poster
Yichong Lu · Yichi Cai · Shangzhan Zhang · Hongyu Zhou · Haoji Hu · Huimin Yu · Andreas Geiger · Yiyi Liao

[ ExHall D ]

Abstract
Photorealistic 3D vehicle models with high controllability are essential for autonomous driving simulation and data augmentation. While handcrafted CAD models provide flexible controllability, free CAD libraries often lack the high-quality materials necessary for photorealistic rendering. Conversely, reconstructed 3D models offer high-fidelity rendering but lack controllability. In this work, we introduce UrbanCAD, a framework that pushes the frontier of the photorealism-controllability trade-off by generating highly controllable and photorealistic 3D vehicle digital twins from a single urban image and a collection of free 3D CAD models and handcrafted materials. These digital twins enable realistic 360-degree rendering, vehicle insertion, material transfer, relighting, and component manipulation such as opening doors and rolling down windows, supporting the construction of long-tail scenarios. To achieve this, we propose a novel pipeline that operates in a retrieval-optimization manner, adapting to observational data while preserving flexible controllability and fine-grained handcrafted details. Furthermore, given multi-view background perspective and fisheye images, we approximate environment lighting using fisheye images and reconstruct the background with 3DGS, enabling the photorealistic insertion of optimized CAD models into rendered novel view backgrounds. Experimental results demonstrate that UrbanCAD outperforms baselines based on reconstruction and retrieval in terms of photorealism. Additionally, we show that various perception models maintain …
Poster
Tianyi Yan · Dongming Wu · Wencheng Han · Junpeng Jiang · xia zhou · Kun Zhan · Cheng-Zhong Xu · Jianbing Shen

[ ExHall D ]

Abstract
Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While the recent efforts of closed-loop simulation offer feedback-driven environments, they cannot process visual sensor inputs or produce outputs that differ from real-world data. To address these challenges, we propose DrivingSphere, a realistic and closed-loop simulation framework. Its core idea is to build 4D world representation and generate real-life and controllable driving scenarios. In specific, our framework includes a Dynamic Environment Composition module that constructs a detailed 4D driving world with a format of occupancy equipping with static backgrounds and dynamic objects, and a Visual Scene Synthesis module that transforms this data into high-fidelity, multi-view video outputs, ensuring spatial and temporal consistency. By providing a dynamic and realistic simulation environment, DrivingSphere enables comprehensive testing and validation of autonomous driving algorithms, ultimately advancing the development of more reliable autonomous cars.The benchmark will be publicly released.
Poster
Haohong Lin · Xin Huang · Tung Phan-Minh · David S Hayden · Huan Zhang · DING ZHAO · Siddhartha Srinivasa · Eric M. Wolff · Hongge Chen

[ ExHall D ]

Abstract
Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating **realistic** and **controllable** traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the **C**ausal **C**ompositional **Diff**usion Model (***CCDiff***), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff’s effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.
Poster
Wayne Wu · Honglin He · Chaoyuan Zhang · Jack He · Seth Z. Zhao · Ran Gong · Quanyi Li · Bolei Zhou

[ ExHall D ]

Abstract
Micromobility, which utilizes lightweight devices moving in urban public spaces - such as delivery robots and electric wheelchairs - emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM -- a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH -- a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse …
Poster
Kaouther Messaoud · Matthieu Cord · Alex Alahi

[ ExHall D ]

Abstract
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. It is often due to limitations like complex architectures customized for a specific dataset and inefficient multimodal handling. We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details. Additionally, our approach of reconstructing segment-level trajectories and lane segments from masked inputs with query drop, enables effective use of contextual information and improves generalization; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation. PerReg+ sets a new state-of-the-art performance on nuScenes, Argoverse 2, and Waymo Open Motion Dataset (WOMD). Remarkable, our pretrained model reduces the error by 6.8% on smaller datasets, and multi-dataset training enhances generalization. In cross-domain tests, PerReg+ reduces B-FDE by 11.8% compared to its non-pretrained variant.
Poster
Deepti Hegde · Rajeev Yasarla · Hong Cai · Shizhong Han · Apratim Bhattacharyya · Shweta Mahajan · Litian Liu · Risheek Garrepalli · Vishal M. Patel · Fatih Porikli

[ ExHall D ]

Abstract
Autonomous driving demands safe motion planning, especially in critical "long-tail'' scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in long-tail scenarios. \ours also achieves state-of-the-art performance on the nuScenes planning benchmark.
Poster
Mingfei Han · Liang Ma · Kamila Zhumakhanova · Ekaterina Radionova · Jingyi Zhang · Xiaojun Chang · Xiaodan Liang · Ivan Laptev

[ ExHall D ]

Abstract
Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators.To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes 100K open-ended description-enriched trajectories with 200K instructions, and 17K action-enriched trajectories from 1847 room tour environments.We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE.Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
Poster
Nedko Savov · Naser Kazemi · Mohammad Mahdi · Danda Paudel · Xi Wang · Luc Van Gool

[ ExHall D ]

Abstract
Modern world models require costly and time consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates generalization abilities on many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent which entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific reward, thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity improvement of up to 6.7 PSNR and controllability of up to 1.3 ΔPSNR.In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 975 virtual environments - a dataset …
Poster
Chenjie Hao · Weyl Lu · Yifan Xu · Yubei Chen

[ ExHall D ]

Abstract
An embodied system must not only model the patterns of the external world but also understand its own motion dynamics. A motion dynamic model is essential for efficient skill acquisition and effective planning. In this work, we introduce Neural Motion Simulator (MoSim), a world model that predicts the physical future state of an embodied system based on current observations and actions. MoSim achieves state-of-the-art performance in physical state prediction, also provides competitive performance across a range of downstream tasks. This model enables embodied systems to perform long-horizon predictions, facilitating efficient skill acquisition in imagined environments and even enabling zero-shot reinforcement learning learning. Furthermore, MoSim can transform any model-free reinforcement learning (RL) algorithm into a model-based approach, effectively decoupling the physical environment modeling from RL algorithm development. This separation allows for independent advancements in RL algorithms and world modeling, significantly improving sample efficiency and enhancing generalization capabilities. Our findings highlight that modeling world models for motion dynamics is a promising direction for developing more versatile and capable embodied systems.
Poster
Yuxuan Wang · Aming Wu · Muli Yang · Yukuan Min · Yihang Zhu · Cheng Deng

[ ExHall D ]

Abstract
This paper pays attention to Weakly Supervised Affordance Grounding (WSAG) task that aims to train model to identify affordance regions using human-object interaction images and egocentric images without the need for costly pixel-level annotations. Most existing methods usually consider the affordance regions to be isolated and directly employ class activation maps to conduct localization, ignoring the relation with other object components and weakening the performance. For example, a cup’s handle is combined with its body to achieve the pouring ability. Obviously, capturing the region relations is beneficial for improving the localization accuracy of affordance regions. To this end, we first explore exploiting hypergraph to discover these relations and propose a Reasoning Mamba (R-Mamba) framework.We first extract feature embeddings from exocentric and egocentric images to construct the hypergraphs consisting of multiple vertices and hyperedges, which capture the in-context local region relationships between different visual components. Subsequently, we design a Hypergraph-guided State Space (HSS) block to reorganize these local relationships from the global perspective. By this mechanism, the model could leverage the captured relationships to improve the localization accuracy of affordance regions. Extensive experiments and visualization analyses demonstrate the superiorities of our method.
Poster
Jiong Lin · Lechen Zhang · Kwansoo Lee · Jialong Ning · Judah A Goldfeder · Hod Lipson

[ ExHall D ]

Abstract
Robot description models are essential for simulation and control, yet their creation often requires significant manual effort. To streamline this modeling process, we introduce AutoURDF, an unsupervised approach for constructing description files for unseen robots from point cloud frames. Our method leverages a cluster-based point cloud registration model that tracks the 6-DoF transformations of point clusters. Through analyzing cluster movements, we hierarchically address the following challenges: (1) moving part segmentation, (2) body topology inference, and (3) joint parameter estimation. The complete pipeline produces robot description files that are fully compatible with existing simulators. We validate our method across a variety of robots, using both synthetic and real-world scan data. Results indicate that our approach outperforms previous methods in registration and body topology estimation accuracy, offering a scalable solution for automated robot modeling.
Poster
Xiaoqi Li · Lingyun Xu · Mingxu Zhang · Jiaming Liu · Yan Shen · Iaroslav Ponomarenko · Jiahui Xu · Liang Heng · Siyuan Huang · Shanghang Zhang · Hao Dong

[ ExHall D ]

Abstract
In robotic manipulation, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To address these challenges, we propose a novel approach using comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner.Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space.Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts.We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
Poster
Yao Mu · Tianxing Chen · Zanxin Chen · ShijiaPeng · Zhiqian Lan · Zeyu Gao · Zhixuan Liang · Qiaojun Yu · Yude Zou · Mingkun Xu · Lunkai Lin · Zhiqiang Xie · Mingyu Ding · Ping Luo

[ ExHall D ]

Abstract
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples improve the success rate of over 70\% for single-arm tasks and over 40\% for dual-arm tasks compared to models trained solely on real-world data. This significant improvement demonstrates RoboTwin's potential to enhance the development …
Poster
Hanzhi Chen · Boyang Sun · Anran Zhang · Marc Pollefeys · Stefan Leutenegger

[ ExHall D ]

Abstract
Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the …
Poster
Yitang Li · Mingxian Lin · Zhuo Lin · Yipeng Deng · Yue Cao · Li Yi

[ ExHall D ]

Abstract
Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.
Poster
Hongxiang Zhao · Xingchen Liu · Mutian Xu · Yiming Hao · Weikai Chen · Xiaoguang Han

[ ExHall D ]

Abstract
We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks.Towards this end, we introduce Roger - a pioneering large-scale dataset of 103,856 ego-centric hand-object interaction videos.Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on Roger, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation.The Roger dataset will be made publicly available upon publication to foster further advancements in the field.
Poster
Wanyue Zhang · Rishabh Dabral · Vladislav Golyanik · Vasileios Choutas · Eduardo Alvarado · Thabo Beeler · Marc Habermann · Christian Theobalt

[ ExHall D ]

Abstract
We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating.To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that excel over the state-of-the-art in motion quality and diversity.
Poster
Zhenrong Wang · Qi Zheng · Sihan Ma · Maosheng Ye · Yibing Zhan · Dongjiang Li

[ ExHall D ]

Abstract
Human-object interaction (HOI) reconstruction has garnered significant attention due to its diverse applications and the success of capturing human meshes. Existing HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.
Poster
Zhengdi Yu · Stefanos Zafeiriou · Tolga Birdal

[ ExHall D ]

Abstract
We propose Dyn-HaMR, to the best of our knowledge, the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild. Reconstructing accurate 3D hand meshes from monocular videos is a crucial task for understanding human behaviour, with significant applications in augmented and virtual reality (AR/VR). However, existing methods for monocular hand reconstruction typically rely on a weak perspective camera model, which simulates hand motion within a limited camera frustum. As a result, these approaches struggle to recover the full 3D global trajectory and often produce noisy or incorrect depth estimations, particularly when the video is captured by dynamic or moving cameras, which is common in egocentric scenarios. Our \name~consists of a multi-stage, multi-objective optimization pipeline, that factors in (i) simultaneous localization and mapping (SLAM) to robustly estimate relative camera motion, (ii) an interacting-hand prior for generative infilling and to refine the interaction dynamics, ensuring plausible recovery under (self-)occlusions, and (iii) hierarchical initialization through a combination of state-of-the-art hand tracking methods.
Poster
Yiming Zhao · Taein Kwon · Paul Streli · Marc Pollefeys · Christian Holz

[ ExHall D ]

Abstract
Estimating touch contact and pressure in egocentric vision is a central task for downstream applications in Augmented Reality, Virtual Reality, as well as many robotic applications, because it provides precise physical insights into hand-object interaction and object manipulation. However, existing contact pressure datasets lack egocentric views and hand poses, which are essential for accurate estimation during in-situ operation, both for AR/VR interaction and robotic manipulation.In this paper, we introduce a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact. The hand poses in our dataset are optimized using our proposed multi-view sequence-based method that processes footage from our capture rig of 8 accurately calibrated RGBD cameras. comprises 5.0 hours of touch contact and pressure interaction from 21 participants captured by a moving egocentric camera and 7 stationary Kinect cameras, which provided RGB images and depth maps at 30 Hz. In addition, we provide baselines for estimating pressure with different modalities, which will enable future developments and benchmarking on the dataset. Overall, we demonstrate that pressure and hand poses are complementary, which supports our intention to better facilitate the physical understanding of hand-object interactions in AR/VR …
Poster
Ziyu Wu · Yufan Xiong · Mengting Niu · Fangting Xie · Quan Wan · Qijun Ying · Boyan Liu · Xiaohui Cai

[ ExHall D ]

Abstract
Long-term in-bed monitoring benefits automatic and real-time health management within healthcare, and the advancement of human shape reconstruction technologies further enhances the representation and visualization of users' activity patterns. However, existing technologies are primarily based on visual cues, facing serious challenges in non-light-of-sight and privacy-sensitive in-bed scenes. Pressure-sensing bedsheets offer a promising solution for real-time motion reconstruction. Yet, limited exploration in model designs and data have hindered its further development. To tackle these issues, we propose a general framework that bridges gaps in data annotation and model design. Firstly, we introduce SMPLify-IB, an optimization method that overcomes the depth ambiguity issue in top-view scenarios through gravity constraints, enabling generating high-quality 3D human shape annotations for in-bed datasets. Then we present PI-HMR, a temporal-based human shape estimator to regress meshes from pressure sequences. By integrating multi-scale feature fusion with high-pressure distribution and spatial position priors, PI-HMR outperforms SOTA methods with 17.01mm Mean-Per-Joint-Error decrease. This work provides a whole tool-chain to support the development of in-bed monitoring with pressure contact sensing.
Poster
Jae-Ho Choi · Soheil Hor · Shubo Yang · Amin Arbabian

[ ExHall D ]

Abstract
One of the main challenges in reliable camera-based 3D pose estimation for walking subjects is to deal with self-occlusions, especially in the case of using low-resolution cameras or at longer distance scenarios. In recent years, millimeter-wave (mmWave) radar has emerged as a promising alternative, offering inherent resilience to the effect of occlusions and distance variations. However, mmWave-based human walking pose estimation (HWPE) is still in the nascent development stages, primarily due to its unique set of practical challenges including the quality of the observed radar signal dependent on the subject’s motion direction. This paper introduces the first comprehensive study comparing mmWave radar to camera systems for HWPE, highlighting its utility for distance-agnostic and occlusion-resilient pose estimation. Building upon mmWave’s unique advantages, we address its intrinsic directionality issue through a new approach—the synergetic integration of multi-modal, multi-view mmWave signals, achieving robust HWPE against variations both in distance and walking direction. Extensive experiments on a newly curated dataset not only demonstrate the superior potential of mmWave technology over traditional camera-based HWPE systems, but also validate the effectiveness of our approach in overcoming the core limitations of mmWave HWPE.
Poster
Shenghao Ren · Yi Lu · Jiayi Huang · Jiayi Zhao · He Zhang · Tao Yu · Qiu Shen · Xun Cao

[ ExHall D ]

Abstract
Existing human Motion Capture (MoCap) method mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale Human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small-kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy for fusing pressure and RGB feature maps. Experiments demonstrate …
Poster
Yinghao Wu · Shihui Guo · Yipeng Qin

[ ExHall D ]

Abstract
While data augmentation (DA) has been extensively studied in computer vision, its application to Inertial Measurement Unit (IMU) signals remains largely unexplored, despite IMUs' growing importance in human motion analysis. In this paper, we present the first systematic study of IMU-specific data augmentation, beginning with a comprehensive analysis that identifies three fundamental properties of IMU signals: their time-series nature, inherent multimodality (rotation and acceleration) and motion-consistency characteristics. Through this analysis, we demonstrate the limitations of applying conventional time-series augmentation techniques to IMU data. We then introduce Motion-Drift Augmentation (MODA), a novel technique that simulates the natural displacement of body-worn IMUs during motion. We evaluate our approach across five diverse datasets and five deep learning settings, including i) fully-supervised, ii) semi-supervised, iii) domain adaptation, iv) domain generalization and v) few-shot learning for both Human Action Recognition (HAR) and Human Pose Estimation (HPE) tasks. Experimental results show that our proposed MODA consistently outperforms existing augmentation methods, with semi-supervised learning performance approaching state-of-the-art fully-supervised methods.
Poster
Xinpeng Liu · Junxuan Liang · Chenshuo Zhang · Zixuan Cai · Cewu Lu · Yonglu Li

[ ExHall D ]

Abstract
Analyses of human motion kinematics have achieved tremendous advances. However, the production mechanism, known as human dynamics, is still undercovered. In this paper, we aim to push data-driven human dynamics understanding forward. We identify a major obstacle to this as the heterogeneity of existing human motion understanding efforts. Specifically, heterogeneity exists in not only the diverse kinematics representations and hierarchical dynamics representations but also in the data from different domains, namely biomechanics and reinforcement learning. With an in-depth analysis of the existing heterogeneity, we propose to emphasize the beneath homogeneity: all of them represent the homogeneous fact of human motion, though from different perspectives. Given this, we propose Homogeneous Dynamics Space (HDyS) as a fundamental space for human dynamics by aggregating heterogeneous data and training a homogeneous latent space with inspiration from the inverse-forward dynamics procedure. Leveraging the heterogeneous representations and datasets, HDyS achieves decent mapping between human kinematics and dynamics. We demonstrate the feasibility of HDyS with extensive experiments and applications. Our code would be made publicly available.
Poster
Wei-Jin Huang · Yuan-Ming Li · Zhi-Wei Xia · Yu-Ming Tang · Kun-Yu Lin · Jian-Fang Hu · Wei-Shi Zheng

[ ExHall D ]

Abstract
Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Some existing methods can only detect errors in action labels, while others can only detect errors by comparing the actual action with static prototypes. Prototype-based methods overlook situations where more than one action is valid following a sequence of executed actions. This leads to two issues: not only can the model not effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training, but the model may also use the wrong prototypes to detect errors if the ongoing action's label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection.
Poster
Yiheng Li · RuiBing Hou · Hong Chang · Shiguang Shan · Xilin Chen

[ ExHall D ]

Abstract
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Poster
Jiaqi Chen · Xiaoye Zhu · Yue Wang · Tianyang Liu · Xinhui Chen · Ying Chen · Chak Tou Leong · Yifei Ke · Joseph Liu · Yiwen Yuan · Julian McAuley · Li-jia Li

[ ExHall D ]

Abstract
We propose a symbolic generative task description language and inference engine, capable of representing arbitrary multimodal tasks as symbolic flows.The inference engine maps natural language instructions to symbolic flow, eliminating the need for task-specific training.Conventional generative models rely heavily on large-scale training and implicit neural representation to learn cross-modal mappings, which demands extensive computational resources and restricts expandability. In this paper, we propose an explicit symbolic task descriptive language, comprising three types of primitives: functions, parameters, and topological logic. Using a pre-trained language model to infer symbolic workflows in a training-free manner, our framework successfully performs over 12 multimodal generative tasks based on user instructions, demonstrating enhanced efficiency and flexibility. Extensive experiments demonstrate that our approach can generate multimodal content competitive with, and often surpassing, that of previous state-of-the-art unified models, while offering robust interruptibility and editability. We believe that symbolic task representations are capable of cost-effectively expanding the boundaries of generative AI capabilities. All code and results are available in the Supplementary Materials.
Poster
Zhengyuan Li · Kai Cheng · Anindita Ghosh · Uttaran Bhattacharya · Liangyan Gui · Aniket Bera

[ ExHall D ]

Abstract
Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often resulting in misalignment between motion semantics and language instructions. In this paper, we introduce MotionDiT, an advanced Diffusion-Transformer-based motion editing model that effectively incorporates editing features both as layer-wise control signals and as input prefixes. To enhance the model's semantic understanding, we also propose a novel auxiliary task, motion similarity prediction, which fosters the learning of semantically meaningful representations. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both editing alignment and fidelity.
Poster
Kwan Yun · Seokhyeon Hong · Chaelin Kim · Junyong Noh

[ ExHall D ]

Abstract
Despite recent advancements in learning-based motion in-betweening, a key limitation has been overlooked: the requirement for character-specific datasets. In this work, we introduce AnyMoLe, a novel method that addresses this limitation by leveraging video diffusion models to generate motion in-between frames for arbitrary characters without external data. Our approach employs a two-stage frame generation process to enhance contextual understanding. Furthermore, to bridge the domain gap between real-world and rendered character animations, we introduce ICAdapt, a fine-tuning technique for video diffusion models. Additionally, we propose a motion-video mimicking'' optimization technique, enabling seamless motion generation for characters with arbitrary joint structures using 2D and 3D-aware features. AnyMoLe significantly reduces data dependency while generating smooth and realistic transitions, making it applicable to a wide range of motion in-betweening tasks.
Poster
Bizhu Wu · Jinheng Xie · Keming Shen · Zhe Kong · Jianfeng Ren · Ruibin Bai · Rong Qu · Linlin Shen

[ ExHall D ]

Abstract
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing studies often focus on coarse-grained motion-text modeling, limiting their ability to handle fine-grained motion-relevant tasks. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text and motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Dataset and code will be released upon paper acceptance.
Poster
Zichong Meng · Yiming Xie · Xiaogang Peng · Zeyu Han · Huaizu Jiang

[ ExHall D ]

Abstract
Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics.However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance.In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability.In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution.Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods.Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
Poster
Shunlin Lu · Jingbo Wang · Zeyu Lu · Ling-Hao Chen · Wenxun Dai · Junting Dong · Zhiyang Dou · Bo Dai · Ruimao Zhang

[ ExHall D ]

Abstract
The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of 1e18. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss.
Poster
Ruopeng Gao · Ji Qi · Limin Wang

[ ExHall D ]

Abstract
Multi-Object Tracking has been a long-standing challenge in video understanding. A natural and intuitive approach is to split this task into two parts: object detection and association. Most mainstream methods employ meticulously crafted heuristic techniques to maintain trajectory information and compute cost matrices for object matching. Although these methods can achieve notable tracking performance, they commonly encounter issues in complex scenarios, thereby often requiring a series of elaborate handcrafted modifications. We believe that manually assumed priors limit the method's adaptability and flexibility, preventing it from directly learning optimal tracking capabilities from domain-specific data. Therefore, we propose a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we proposed a straightforward method termed MOTIP. Without using tailored or sophisticated architectures, our method achieved state-of-the-art results across multiple benchmarks by solely leveraging object-level features as tracking cues. The simplicity and impressive results of MOTIP leave substantial room for future advancements, thereby making it a promising baseline for subsequent research.
Poster
Libo Long · Xiao Hu · Jochen Lang

[ ExHall D ]

Abstract
Recent methods have made significant progress in optical flow estimation. However, the evaluation of these methods mainly focus on improved accuracy in benchmarks and often overlook the analysis of the robustness or behavior of the networks, which could be important in safety-critical scenarios such as autonomous driving. In this paper, we propose a novel method for robustness evaluation by modifying data from original benchmarks. Unlike previous benchmarks that focus on complex scenes, we propose to modify key objects from the original images in order to analyze the sensitivity to these changes observed in the output. Our aim is to identify common failure cases of state-of-the-art (SOTA) methods to evaluate their robustness and understand their behaviors. We show that: Optical flow methods are more sensitive to shape changes than to texture changes; and optical flow methods tend to “remember” objects seen during training and may “ignore” the motion of unseen objects. Our experimental results and findings provide a more in-depth understanding of the behavior of recent optical flow methods, suggesting the need for more careful design, especially in safety-critical scenarios. The code and data will be made available.
Poster
Hanyu Zhou · Haonan Wang · Haoyue Liu · Yuxing Duan · Yi Chang · Luxin Yan

[ ExHall D ]

Abstract
High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion …
Poster
Qiyao Gao · Peiqi Duan · Hanyue Lou · Minggui Teng · Ziqi Cai · Xu Chen · Boxin Shi

[ ExHall D ]

Abstract
This paper addresses the challenge that current event-based video reconstruction methods cannot produce static background information. Recent research has uncovered the potential of event cameras in capturing static scenes. Nonetheless, image quality deteriorates due to noise interference and detail loss, failing to provide reliable background information. We propose a two-stage reconstruction strategy to address these challenges and reconstruct static scene images comparable to frame cameras. Building on this, we introduce the URSEE framework, the first unified framework designed for reconstructing motion videos with static backgrounds. This framework includes a parallel channel that can simultaneously process static and dynamic events, and a network module designed to reconstruct videos encompassing both static and dynamic scenes in an end-to-end manner. We also collect a real-captured dataset for static reconstruction, containing both indoor and outdoor scenes. Comparison results indicate that the proposed approach achieves state-of-the-art reconstruction results on both synthetic and real data.
Poster
Alejandro Castañeda Garcia · Jan Warchocki · Jan van Gemert · Daan Brinks · Nergis Tomen

[ ExHall D ]

Abstract
Extracting physical dynamical system parameters from recorded observations is key in natural science. Current methods for automatic parameter estimation from video train supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques--which depend on frame prediction--exist, they suffer from long training times, initialization instabilities, only consider motion-based dynamical systems, and are evaluated mainly on synthetic data. In this work, we propose an unsupervised method to estimate the physical parameters of known, continuous governing equations from single videos suitable for different dynamical systems beyond motion and robust to initialization. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute. We first evaluate our model on synthetic data, as commonly done. After which, we take the field closer to reality by recording our own real-world dataset of 75 videos for five different types of dynamical systems to evaluate our method and others. Our method compares favorably to others. We will release all data and code.
Poster
Gene Chou · Kai Zhang · Sai Bi · Hao Tan · Zexiang Xu · Fujun Luan · Bharath Hariharan · Noah Snavely

[ ExHall D ]

Abstract
We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model’s ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms commercial models in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
Poster
guojun lei · Chi Wang · Rong Zhang · Yikai Wang · Hong Li · Weiwei Xu

[ ExHall D ]

Abstract
We propose a unified approach for video-controlled generation, enabling text-based guidance and manual annotations to control the generation of videos, similar to camera direction guidance. Specifically, we designed a two-stage algorithm. In the first stage, we convert all control information into frame-by-frame motion flows. In the second stage, we use these motion flows as guidance to control the final video generation. Additionally, to reduce instability in the generated videos caused by large motion variations (such as those from camera movement, object motion, or manual inputs), which can result in flickering or the intermittent disappearance of objects, we transform the temporal feature computation in the video model into frequency-domain feature computation. This is because frequency-domain signals better capture the essential characteristics of an image, and by ensuring consistency in the video's frequency-domain features, we can enhance temporal coherence and reduce flickering in the final generated video.
Poster
Zhongwei Zhang · Fuchen Long · Zhaofan Qiu · Yingwei Pan · Wu Liu · Ting Yao · Tao Mei

[ ExHall D ]

Abstract
Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically regard the Gaussian filtered trajectory as sole motion control signal. Nevertheless, the flow approximation via Gaussian kernel limits the controllability of fine-grained movement, and commonly fails to disentangle object and camera moving. To alleviate these, we present MotionPro, a new recipe of region-wise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify exact target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories from multiple local regions to simulate inference scenario. Instead of approximating flow distributions generally using a large Gaussian kernel, our region-wise trajectory provides a more precise control by directly employing trajectories in local region and thus manages to characterize fine-grained movement. A motion mask is simultaneously derived from the predicted flow maps to present holistic motion dynamics. To pursue natural motion control, MotionPro further strengthens video denoising with additional conditions of region-wise trajectory and motion mask in a feature modulation manner. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated …
Poster
Tianyi Zhu · Dongwei Ren · Qilong Wang · Xiaohe Wu · Wangmeng Zuo

[ ExHall D ]

Abstract
Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves. The source code will be …
Poster
Jiangtong Tan · Hu Yu · Jie Huang · Jie Xiao · Feng Zhao

[ ExHall D ]

Abstract
Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to inconsistency and poor quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to …
Poster
Qingtao Yu · Jaskirat Singh · Zhaoyuan Yang · Peter Henry Tu · Jing Zhang · Richard Hartley · Hongdong Li · Dylan Campbell

[ ExHall D ]

Abstract
Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.
Poster
Alper Kayabasi · Anil Kumar Vadathya · Guha Balakrishnan · Vishwanath Saragadam

[ ExHall D ]

Abstract
We propose a new continuous video modeling framework based on implicit neural representations (INRs) called \textbf{ActINR}. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR's biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including 10x video slow motion, 4x spatial super resolution along with 2x slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.
Poster
Eunjin Kim · HYEONJIN KIM · Kyong Hwan Jin · Jaejun Yoo

[ ExHall D ]

Abstract
Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve—and even degrade—performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose **BF-STVSR**, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency. Our code will be available soon.
Poster
Chun Zhang · Heming Sun · Jiro Katto

[ ExHall D ]

Abstract
Learned Video Compression (LVC) aims to reduce redundancy in sequential data through deep learning approaches. Recent advances have significantly boosted LVC performance by shifting compression operations to feature domain, often combining Motion Estimation and Motion Compensation module(MEMC) with CNN-based context extraction. However, reliance on motions and convolution-driven context models limits generalizability and global perception. To address these issues, we propose a Feature-level Attention (FLA) module within a Transformer-based framework that perceives full-frame explicitly, thus bypassing confined motion signatures. FLA accomplishes global perception by converting high-level local patch embeddings to one-dimensional batch-wise vectors and replacing traditional attention weights to a global context matrix. Amongst this, a dense overlapping patcher (DP) is introduced to retain local features before embedding projection. Furthermore, a Transformer-CNN mixed encoder is applied to alleviate the spatial feature bottleneck without expanding latent size. Experiments demonstrate excellent generalizability with universally efficient redundancy reduction in different scenarios. Extensive tests on four video compression datasets show that our method achieves state-of-the-art Rate-Distortion performance compared to existing LVC methods and traditional codecs. A down-scaled version of our model reduced computation overhead by a great margin while maintained great performance.
Poster
Lei Ke · Haohang Xu · Xuefei Ning · Yu Li · Jiajun Li · Haoling Li · Yuxuan Lin · Dongsheng Jiang · Yujiu Yang · Linfeng Zhang

[ ExHall D ]

Abstract
Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Our experimental result on SDv1.5 demonstrates our method achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, closed to our teacher model (32 DDIM steps, FID = 10.05). Our codes will be released at Github.
Poster
Yudong Mao · Hao Luo · Zhiwei Zhong · Peilin CHEN · Zhijiang Zhang · Shiqi Wang

[ ExHall D ]

Abstract
Unlike modern native digital videos, the restoration of old films requires addressing specific degradations inherent to analog sources. However, existing specialized methods still fall short compared to general video restoration techniques. In this work, we propose a new baseline to re-examine the challenges in old film restoration. First, we develop an improved Mamba-based framework, dubbed MambaOFR, which can dynamically adjust the degradation removal patterns by generating degradation-aware prompts to tackle the complex and composite degradations present in old films. Second, we introduce a flow-guided mask deformable alignment module to mitigate the propagation of structured defect features in the temporal domain. Third, we introduce the first benchmark dataset that includes both synthetic and real-world old film clips. Extensive experiments show that the proposed method achieves state-of-the-art performance, outperforming existing advanced approaches in old film restoration. The implementation and model will be released.
Poster
Rohit Kundu · Hao Xiong · Vishal Mohanty · Athula Balachandran · Amit K. Roy-Chowdhury

[ ExHall D ]

Abstract
Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches.To address this, we introduce the Universal Network for Identifying Tampered and Engineered videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model’s tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.
Poster
Duosheng Chen · Shihao Zhou · Jinshan Pan · Jinglei Shi · lishen qu · Jufeng Yang

[ ExHall D ]

Abstract
Effectively leveraging motion information is crucial for the image deblurring task. Existing methods typically build deep-learning models to restore a clean image by estimating blur patterns over the entire movement. This suggests that the blur caused by rotational motion components is processed together with the translational one. Exploring the movement without separation leads to limited performance for complex motion deblurring, especially rotational motion. In this paper, we propose Motion Decomposition Transformer (MDT), a transformer-based architecture augmented with polarized modules for deblurring via motion vector decomposition. MDT consists of a Motion Decomposition Module (MDM) for extracting hybrid rotation and translation features, and a Radial Stripe Attention Solver (RSAS) for sharp image reconstruction with enhanced rotational information. Specifically, the MDM uses a deformable Cartesian convolutional branch to capture translational motion, complemented by a polar-system branch to capture rotational motion. The RSAS employs radial stripe windows and angular relative positional encoding in the polar system to enhance rotational information. This design preserves translational details while keeping computational costs lower than dual-coordinate design. Experimental results on 6 image deblurring datasets show that MDT outperforms state-of-the-art methods, particularly in handling blur caused by complex motions with significant rotational components.
Poster
Siwei Tu · Ben Fei · Weidong Yang · Fenghua Ling · Hao Chen · Zili Liu · Kun Chen · Hang Fan · Wanli Ouyang · Lei Bai

[ ExHall D ]

Abstract
Accurate acquisition of surface meteorological conditions at arbitrary locations holds significant importance for weather forecasting and climate simulation. Due to the fact that meteorological states derived from satellite observations are often provided in the form of low-resolution grid fields, the direct application of spatial interpolation to obtain meteorological states for specific locations often results in significant discrepancies when compared to actual observations. Existing downscaling methods for acquiring meteorological state information at higher resolutions commonly overlook the correlation with satellite observations. To bridge the gap, we propose Satellite-observations Guided Diffusion Model (SGD), a conditional diffusion model pre-trained on ERA5 reanalysis data with satellite observations (GridSat) as conditions, which is employed for sampling downscaled meteorological states through a zero-shot guided sampling strategy and patch-based methods. During the training process, we propose to fuse the information from GridSat satellite observations into ERA5 maps via the attention mechanism, enabling SGD to generate atmospheric states that align more accurately with actual conditions. In the sampling, we employed optimizable convolutional kernels to simulate the upscale process, thereby generating high-resolution ERA5 maps using low-resolution ERA5 maps as well as observations from weather stations as guidance. Moreover, our devised patch-based method promotes SGD to generate meteorological states at …
Poster
Zhuoran Du · Shaodi You · Cheng Cheng · Shikui Wei

[ ExHall D ]

Abstract
Hyperspectral image (HSI) densely samples the world in both the space and frequency domain and therefore is more distinctive than RGB images. Usually, HSI needs to be calibrated to minimize the impact of various illumination conditions. The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility.These limitations inspire this paper to automatically calibrate HSIs using a learning-based method.Towards this goal, a large-scale HSI calibration dataset is created, which has 765 high-quality HSI pairs covering diversified natural scenes and illuminations. The dataset is further expanded to 7650 pairs by combining with 10 different physically measured illuminations.A spectral illumination transformer (SIT) together with an illumination attention module is proposed. Extensive benchmarks demonstrate the SoTA performance of the proposed SIT. The benchmarks also indicate that low-light conditions are more challenging than normal conditions.The dataset and codes are anonymously available online: https://anonymous.4open.science/r/Automatic-spectral-calibration-of-HSI-0C5A
Poster
Dabing Yu · Zheng Gao

[ ExHall D ]

Abstract
Capitalizing on the talent of self-attention in capturing non-local features, Transformer architectures have exhibited remarkable performance in single hyperspectral image restoration. For hyperspectral images, each pixel is located in the hyperspectral image cubes with a large spectral dimension and two spatial dimensions. Although uni-dimensional self-attention, like channel self-attention or spatial self-attention, builds long-range dependencies in spectral or spatial dimensions, they lack more comprehensive interactions across dimensions. To tackle the above drawback, we propose a VolFormer, a volumetric self-attention embedded Transformer network for single hyperspectral image restoration. Specifically, we propose volumetric self-attention (VolSA), which extends the interaction from 2D flat to 3D cube. VolSA can simultaneously model token interaction in the 3D cube, mining the potential correlations between the hyperspectral image cube. An attention decomposition form is proposed to reduce the computational burden of modeling volumetric information. In practical terms, VolSA adapts double similarity matrixes in spatial and channel dimensions to implicitly model 3D context information while transforming the complexity from cubic to quadratic. Additionally, we introduce the explicit spectral location prior to enhance the proposed self-attention. This property allows the target token to perceive global spectral information while simultaneously assigning different levels of attention to tokens at varying wavelength bands. …
Poster
Chunyang Cheng · Tianyang Xu · Zhenhua Feng · Xiaojun Wu · Zhangyong Tang · Hui Li · Zhang Zeyang · Sara Atito · Muhammad Awais · Josef Kittler

[ ExHall D ]

Abstract
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be released.
Poster
Xin Lu · Jie Xiao · Yurui Zhu · Xueyang Fu

[ ExHall D ]

Abstract
All-in-one models for adverse weather removal aim to process various degraded images using a single set of parameters, making them ideal for real-world scenarios. However, they encounter two main challenges: catastrophic forgetting and limited degradation awareness. The former causes the model to lose knowledge of previously learned scenarios, reducing its overall effectiveness. While the later hampers the model’s ability to accurately identify and respond to specific types of degradation, limiting its performance across diverse adverse weather conditions. To address these issues, we introduce the Incremental Learning Adverse Weather Removal (ILAWR) framework, which uses a novel degradation-aware distillation strategy for continuous weather removal. Specifically, we first design a degradation-aware module that utilizes Fourier priors to capture a broad range of degradation features, effectively mitigating catastrophic forgetting in low-level visual tasks. Then, we implement multilateral distillation, which combines knowledge from multiple teacher models using an importance-guided aggregation approach. This enables the model to balance adaptation to new degradation types with the preservation of background details. Extensive experiments on both synthetic and real-world datasets confirm that ILAWR outperforms existing models across multiple benchmarks, proving its effectiveness in continuous adverse weather removal.
Poster
Hang Guo · Yong Guo · Yaohua Zha · Yulun Zhang · Wenbo Li · Tao Dai · Shu-Tao Xia · Yawei Li

[ ExHall D ]

Abstract
The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by even 0.35dB PSNR for lightweight SR even with 9.3% less parameters and suppresses HAT on classic SR by up to 0.29dB.
Poster
Kun Zhou · Xinyu Lin · Jiangbo Lu

[ ExHall D ]

Abstract
Recently, Mamba-based frameworks have achieved substantial advancements across diverse computer vision and NLP tasks, particularly in their capacity for reasoning over long-range information with linear complexity. However, the fixed 2D-to-1D scanning pattern overlooks the local structures of an image, limiting its effectiveness in aggregating 2D spatial information. While stacking additional Mamba layers can partially address this issue, it increases parameter intensity and constrains real-time application. In this work, we reconsider the local optimal scanning path in Mamba, enhancing the rigid and uniform 1D scan through the local shortest path theory, thus creating a structure-aware Mamba suited for lightweight single-image super-resolution. Specifically, we draw inspiration from the Traveling Salesman Problem (TSP) to establish a local optimal scanning path for improved structural 2D information utilization. Here, local patch aggregation occurs in a content-adaptive manner with minimal propagation cost. TSP-Mamba demonstrates substantial improvements over existing Mamba-based and Transformer-based architectures. For example, TSP-Mamba surpasses MambaIR by up to 0.7dB in lightweight SISR, with comparable parameters and very slightly extra computational demands (1-2 GFlops for 720P images).
Poster
Shangquan Sun · Wenqi Ren · Juxiang Zhou · Shu Wang · Jianhou Gan · Xiaochun Cao

[ ExHall D ]

Abstract
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we integrate a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks. Our code will be made publicly available.
Poster
Sudarshan Rajagopalan · Nithin Gopalakrishnan Nair · Jay Paranjape · Vishal M. Patel

[ ExHall D ]

Abstract
Deep learning–based models for All-In-One image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model, capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on GenDS dataset exhibit significant improvements in out-of-distribution performance as compared to when trained solely on existing datasets. Furthermore, we provide comprehensive analyses on implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.
Poster
Brayan Monroy · Jorge Bacca · Julián Tachella

[ ExHall D ]

Abstract
Recorrupted-to-Recorrupted (R2R) has emerged as a methodology for training deep networks for image restoration in a self-supervised manner from noisy measurement data alone, demonstrating equivalence in expectation to the supervised squared loss in the case of Gaussian noise. However, its effectiveness with non-Gaussian noise remains unexplored. In this paper, we propose Generalized R2R (GR2R), extending the R2R framework to handle a broader class of noise distribution as additive noise like log-Rayleigh and address the natural exponential family including Poisson, Gamma and Binomial noise distributions, which play a key role in many applications including low-photon imaging and synthetic aperture radar. We show that the GR2R loss is an unbiased estimator of the supervised loss and that the popular Stein's unbiased risk estimator can be seen as a special case. A series of experiments with Gaussian, Poisson, and Gamma noise validate GR2R's performance, showing its effectiveness compared to other self-supervised methods.
Poster
Xiangpeng Tian · Xiangyu Liao · Xiao Liu · Meng Li · Chao Ren

[ ExHall D ]

Abstract
All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture.Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. All the source code and trained models will be made available to the public.
Poster
Guanglu Dong · Xiangyu Liao · Mingyang Li · Guihuan Guo · Chao Ren

[ ExHall D ]

Abstract
Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world …
Poster
Junyang Chen · Jinshan Pan · Jiangxin Dong

[ ExHall D ]

Abstract
Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
Poster
Zhu Liu · Zijun Wang · Jinyuan Liu · Fanqi Meng · Long Ma · Risheng Liu

[ ExHall D ]

Abstract
Thermal imaging is often compromised by dynamic, complex degradations caused by hardware limitations and unpredictable environmental factors. The scarcity of high-quality infrared data, coupled with the challenges of dynamic, intricate degradations, makes it difficult to recover details using existing methods. In this paper, we introduce thermal degradation simulation integrated into the training process via a mini-max optimization, by modeling these degraded factors as adversarial attacks on thermal images. The simulation is dynamic to maximize objective functions, thus capturing a broad spectrum of degraded data distributions. This approach enables training with limited data, thereby improving model performance.Additionally, we introduce a dual-interaction network that combines the benefits of spiking neural networks with scale transformation to capture degraded features with sharp spike signal intensities. This architecture ensures compact model parameters while preserving efficient feature representation. Extensive experiments demonstrate that our method not only achieves superior visual quality under diverse single and composited degradation, but also delivers a significant reduction in processing when trained on only fifty clear images, outperforming existing techniques in efficiency and accuracy.
Poster
Bin Chen · Gehui Li · Rongyuan Wu · Xindong Zhang · Jie Chen · Jian Zhang · Lei Zhang

[ ExHall D ]

Abstract
Real-world image super-resolution (Real-ISR) aims to reconstruct high-resolution images from low-resolution inputs degraded by complex, unknown processes. While many Stable Diffusion (SD)-based Real-ISR methods have achieved remarkable success, their slow, multi-step inference hinders practical deployment. Recent SD-based one-step networks like OSEDiff and S3Diff alleviate this issue but still incur high computational costs due to their reliance on large pretrained SD models. This paper proposes a novel Real-ISR method, **AdcSR**, by distilling the one-step diffusion network OSEDiff into a streamlined diffusion-GAN model under our **A**dversarial **D**iffusion **C**ompression (**ADC**) framework. We meticulously examine the modules of OSEDiff, categorizing them into two types: **(1) Removable** (VAE encoder, prompt extractor, text encoder, *etc.*) and **(2) Prunable** (denoising UNet and VAE decoder). Since direct removal and pruning can degrade the model's generation capability, we pretrain our pruned VAE decoder to restore its ability to decode images and employ adversarial distillation to compensate for performance loss. This ADC-based diffusion-GAN hybrid design effectively reduces complexity by 73\% in inference time, 78\% in computation, and 74\% in parameters, while preserving the model’s generation capability. Experiments manifest that our proposed AdcSR achieves competitive recovery quality on both synthetic and real-world datasets, offering up to 9.3× speedup over previous one-step …
Poster
Xiaoling Zhou · Zhemg Lee · Wei Ye · Rui Xie · Wenbo Zhang · Guanju Peng · Zongze Li · Shikun Zhang

[ ExHall D ]

Abstract
Image denoising poses a significant challenge in image processing, aiming to remove noise and artifacts from input images. However, current denoising algorithms implemented on electronic chips frequently encounter latency issues and demand substantial computational resources. In this paper, we introduce an all-optical Nonlinear Diffractive Denoising Deep Network (N3DNet) for image denoising at the speed of light. Initially, we incorporate an image encoding and pre-denoising module into the Diffractive Deep Neural Network and integrate a nonlinear activation function, termed the phase exponential linear function, after each diffractive layer, thereby boosting the network's nonlinear modeling and denoising capabilities. Subsequently, we devise a new reinforcement learning algorithm called regularization-assisted deep Q-network to optimize N3DNet. Finally, leveraging 3D printing techniques, we fabricate N3DNet using the trained parameters and construct a physical experimental system for real-world applications. A new benchmark dataset, termed MIDD, is constructed for mode image denoising, comprising 120K pairs of noisy/noise-free images captured from real fiber communication systems across various transmission lengths. Through extensive simulation and real experiments, we validate that N3DNet outperforms both traditional and deep learning-based denoising approaches across various datasets. Remarkably, its processing speed is nearly 3,800 times faster than electronic chip-based methods.
Poster
Bohan Xiao · PEIYONG WANG · Qisheng He · Ming Dong

[ ExHall D ]

Abstract
In this paper, we propose a novel deterministic model based on the Brownian Bridge framework, leveraging Stochastic Differential Equations (SDEs). Our approach is designed to address the limitations of stochasticity present in existing Bridge models. Specifically, we introduce a method where two neural networks are employed: one for predicting the score function and the other for estimating the noise. By doing so, our model ensures a deterministic outcome, which is crucial for tasks requiring consistency and precision, such as super-resolution and medical image reconstruction. Our key contributions are as follows
Poster
Jiawan Li · Fei Zhou · Zhipeng Zhong · Jiongzhi Lin · Guoping Qiu

[ ExHall D ]

Abstract
Hundreds of millions of people routinely take photos using their smartphones as point and shoot (PAS) cameras, yet very few would have the photography skills to compose a good shot of a scene. While traditional PAS cameras have built-infunctions to ensure a photo is well focused and has the right brightness, they cannot tell the users how to compose the best shot of a scene. In this paper, we present a first of its kind smart point and shoot (SPAS) system to help users to take good photos. Our SPAS proposes to help users to compose a good shot of a scene by automatically guiding the users to adjust the camera pose live on the scene. We first constructed a large dataset containing 320K images with camera pose information from 4000 scenes. We then developed an innovative CLIP-based Composition Quality Assessment (CCQA) model to assign pseudo labels to these images. The CCQA introduces a unique learnable text embedding technique to learn continuous word embeddings capable of discerning subtle visual quality differences in the range covered by five levels of quality description words {bad,poor,fair,good,perfect}. And finally we have developed a camera pose adjustment model (CPAM) which first …
Poster
Tianyu Wang · Jianming Zhang · Haitian Zheng · Zhihong Ding · Scott Cohen · Zhe Lin · Wei Xiong · Chi-Wing Fu · Luis Figueroa · Soo Ye Kim

[ ExHall D ]

Abstract
Shadows are often underconsidered or even ignored in image editing applications, limiting the realism of the edited results. In this paper, we introduce MetaShadow, a three-in-one versatile framework that enables detection, removal, and controllable synthesis of shadows in natural images in an object-centered fashion. MetaShadow combines the strengths of two cooperative components: Shadow Analyzer, for object-centered shadow detection and removal, and Shadow Synthesizer, for reference-based controllable shadow synthesis. Notably, we optimize the learning of the intermediate features from Shadow Analyzer to guide Shadow Synthesizer to generate more realistic shadows that blend seamlessly with the scene. Extensive evaluations on multiple shadow benchmark datasets show significant improvements of MetaShadow over the existing state-of-the-art methods on object-centered shadow detection, removal, and synthesis. MetaShadow excels in supporting imageediting tasks such as object removal, relocation, and insertion, pushing the boundaries of object-centered image editing.
Poster
Jing Wu · Trung Le · Munawar Hayat · Mehrtash Harandi

[ ExHall D ]

Abstract
Diffusion models are highly effective at generating high-quality images but pose risks, such as the unintentional generation of NSFW (not safe for work) content.Although various techniques have been proposed to mitigate unwanted influences in diffusion models while preserving overall performance, achieving a balance between these goals remains challenging.In this work, we introduce EraseDiff, an algorithm designed to preserve the utility of the diffusion model on retained data while removing the unwanted information associated with the data to be forgotten.Our approach formulates this task as a constrained optimization problem using the value function, resulting in a natural first-order algorithm for solving the optimization problem.By altering the generative process to deviate away from the ground-truth denoising trajectory, we update parameters for preservation while controlling constraint reduction to ensure effective erasure, striking an optimal trade-off.Extensive experiments and thorough comparisons with state-of-the-art algorithms demonstrate that EraseDiff effectively preserves the model's utility, efficacy, and efficiency.
Poster
Yixing Zhu · Qing Zhang · Yitong Wang · Yongwei Nie · Wei-Shi Zheng

[ ExHall D ]

Abstract
This paper presents EntityErasure, a novel diffusion-based method that can effectively erase entity without inducing unwanted sundries. To this end, we propose to address this problem by dividing it into amodal entity segmentation and completion, such that the region to inpaint takes only entities in the non-inpainting area as reference, avoiding the possibility to generate unpredictable sundries. Moreover, we propose two novel metrics, for assessing the quality of object erasure based on entity segmentation, which are shown be more effective than existing metrics. Experimental results demonstrate that our approach significantly outperforms other state-of-the-art object erasure methods.
Poster
Ji Woo Hong · Tri Ton · Trung X. Pham · Gwanhyeong Koo · Sunjae Yoon · Chang D. Yoo

[ ExHall D ]

Abstract
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead.A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the …
Poster
Matheus Souza · Yidan Zheng · Kaizhang Kang · Yogeshwar Nath Mishra · Qiang Fu · Wolfgang Heidrich

[ ExHall D ]

Abstract
Digital imaging systems have traditionally relied on brute-force measurement and processing of pixels arranged on regular grids. In contrast, the human visual system performs significant data reduction from the large number of photoreceptors to the optic nerve, effectively encoding visual information into a low-bandwidth latent space representation optimized for brain processing. Inspired by this, we propose a similar approach to advance artificial vision systems.Latent Space Imaging introduces a new paradigm that combines optics and software to encode image information directly into the semantically rich latent space of a generative model. This approach substantially reduces bandwidth and memory demands during image capture and enables a range of downstream tasks focused on the latent space.We validate this principle through an initial hardware prototype based on a single-pixel camera. By implementing an amplitude modulation scheme that encodes into the generative model's latent space, we achieve compression ratios ranging from 1:100 to 1:1000 during imaging, and up to 1:16384 for downstream applications. This approach leverages the model's intrinsic linear boundaries, demonstrating the potential of latent space imaging for highly efficient imaging hardware, adaptable future applications in high-speed imaging, and task-specific cameras with significantly reduced hardware complexity.
Poster
Lei Chen · Yuan Meng · Chen Tang · Xinzhu Ma · Jingyan Jiang · Xin Wang · Zhi Wang · Wenwu Zhu

[ ExHall D ]

Abstract
Recent advancements in diffusion models, particularly the architectural transformation from UNet-based models to Diffusion Transformers (DiTs), significantly improve the quality and scalability of image and video generation. However, despite their impressive capabilities, the substantial computational costs of these large-scale models pose significant challenges for real-world deployment. Post-Training Quantization (PTQ) emerges as a promising solution, enabling model compression and accelerated inference for pretrained models, without the costly retraining. However, research on DiT quantization remains sparse, and existing PTQ frameworks, primarily designed for traditional diffusion models, tend to suffer from biased quantization, leading to notable performance degradation. In this work, we identify that DiTs typically exhibit significant spatial variance in both weights and activations, along with temporal variance in activations. To address these issues, we propose Q-DiT, a novel approach that seamlessly integrates two key techniques: automatic quantization granularity allocation to handle the significant variance of weights and activations across input channels, and sample-wise dynamic activation quantization to adaptively capture activation changes across both timesteps and samples. Extensive experiments conducted on ImageNet and VBench demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W6A8 on ImageNet (256×256), Q-DiT achieves a remarkable reduction in FID by 1.09 compared …
Poster
Sotiris Anagnostidis · Gregor Bachmann · Yeongmin Kim · Jonas Kohler · Markos Georgopoulos · Artsiom Sanakoyeu · Yuming Du · Albert Pumarola · Ali Thabet · Edgar Schoenfeld

[ ExHall D ]

Abstract
Despite their remarkable performance, modern Diffusion Transformers (DiTs) are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into flexible ones --- dubbed FlexiDiT --- allowing them to process inputs at varying compute budgets. We demonstrate how a single flexible model can generate images without any drop in quality, while reducing the required FLOPs by more than 40\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to 75\% less compute without compromising performance.
Poster
Vishal Purohit · Matthew Repasky · Jianfeng Lu · Qiang Qiu · Yao Xie · Xiuyuan Cheng

[ ExHall D ]

Abstract
Posterior sampling in high-dimensional spaces using generative models holds significant promise for various applications, including but not limited to inverse problems and guided generation tasks. Generating diverse posterior samples remains expensive, as existing methods require restarting the entire generative process for each new sample. In this work, we propose a posterior sampling approach that simulates Langevin dynamics in the noise space of a pre-trained generative model. By exploiting the mapping between the noise and data spaces which can be provided by distilled flows or consistency models, our method enables seamless exploration of the posterior without the need to re-run the full sampling chain, drastically reducing computational overhead. Theoretically, we prove a guarantee for the proposed noise-space Langevin dynamics to approximate the posterior, assuming that the generative model sufficiently approximates the prior distribution. Our framework is experimentally validated on image restoration tasks involving noisy linear and nonlinear forward operators applied to LSUN-Bedroom (256 x 256) and ImageNet (64 x 64) datasets. The results demonstrate that our approach generates high-fidelity samples with enhanced semantic diversity even under a limited number of function evaluations, offering superior efficiency and performance compared to existing diffusion-based posterior sampling techniques.
Poster
Wenxin Su · Song Tang · Xiaofeng Liu · Xiaojing Yi · Mao Ye · Chunxiao Zu · Jiahao Li · Xiatian Zhu

[ ExHall D ]

Abstract
Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way—learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate …
Poster
Johannes Schusterbauer · Ming Gui · Frank Fundel · Björn Ommer

[ ExHall D ]

Abstract
Recent advancements in diffusion models have established new benchmarks in both generative tasks and downstream applications. In contrast, flow matching models have shown promising improvements in performance but have not been as extensively explored, particularly due to the difficulty of inheriting knowledge from a pretrained diffusion prior foundation model.In this work, we propose a novel method to bridge the gap between pretrained diffusion models and flow matching models by aligning their trajectories and matching their objectives. Our approach mathematically formalizes this alignment and enables the efficient transfer of knowledge from diffusion priors to flow matching models. We demonstrate that our method outperforms traditional diffusion and flow matching finetuning, achieving competitive results across a variety of tasks.
Poster
Hao Chen · Ze Wang · Xiang Li · Ximeng Sun · Fangyi Chen · Jiang Liu · Jindong Wang · Bhiksha Raj · Zicheng Liu · Emad Barsoum

[ ExHall D ]

Abstract
Efficient image tokenization with high compression ratios remains a critical challenge for training generative models.We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256×256 and 512×512 images using only 32 or 64 1-dimensional tokens.Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256×256 images and 55x for 512×512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL.It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models.Code and model will be released.
Poster
YONGWEI CHEN · Yushi Lan · Shangchen Zhou · Tengfei Wang · Xingang Pan

[ ExHall D ]

Abstract
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content.Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.
Poster
Aneeshan Sain · Subhajit Maity · Pinaki Nath Chowdhury · Subhadeep Koley · Ayan Kumar Bhunia · Yi-Zhe Song

[ ExHall D ]

Abstract
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96\% percent and 84.89\% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37\% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03\% vs 32.77\%) -- …
Poster
Hmrishav Bandyopadhyay · Yi-Zhe Song

[ ExHall D ]

Abstract
Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation -- just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.
Poster
Ozgur Kara · Krishna Kumar Singh · Feng Liu · Duygu Ceylan · James Rehg · Tobias Hinz

[ ExHall D ]

Abstract
Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines.
Poster
Qifan Yu · Wei Chow · Zhongqi Yue · Kaihang Pan · Yang Wu · Xiaoyang Wan · Juncheng Li · Siliang Tang · Hanwang Zhang · Yueting Zhuang

[ ExHall D ]

Abstract
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity. The code is available in \url{https://anonymous.4open.science/r/AnyEdit-C53B}.
Poster
Shuchen Weng · Haojie Zheng · Peixuan Zhang · Yuchen Hong · Han Jiang · Si Li · Boxin Shi

[ ExHall D ]

Abstract
We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
Poster
Yixuan Zhu · Haolin Wang · Shilin Ma · Wenliang Zhao · Yansong Tang · Lei Chen · Jie Zhou

[ ExHall D ]

Abstract
Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE—a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component’s specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively.
Poster
Tongda Xu · Jiahao Li · Bin Li · Yan Wang · Ya-Qin Zhang · Yan Lu

[ ExHall D ]

Abstract
Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, existing methods often produce noticeable artifacts when compressing text in screen content. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (\textbf{PICD}), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.
Poster
Ka Chun SHUM · Binh-Son Hua · Thanh Nguyen · Sai-Kit Yeung

[ ExHall D ]

Abstract
Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.
Poster
Nam Anh Dinh · Itai Lang · Hyunwoo Kim · Oded Stein · Rana Hanocka

[ ExHall D ]

Abstract
In this work, we present Geometry in Style, a new method for identity-preserving mesh stylization. Existing techniques either adhere to the original shape through overly restrictive deformations such as bump maps or significantly modify the input shape using expressive deformations that may introduce artifacts or alter the identity of the source shape. In contrast, we represent a deformation of a triangle mesh as a target normal vector for each vertex neighborhood. The deformations we recover from target normals are expressive enough to enable detailed stylizations and at the same time restrictive enough to preserve the shape's identity. We achieve such deformations using our novel differentiable As-Rigid-As-Possible (dARAP) layer, a neural-network-ready adaptation of the classical ARAP algorithm which we use to solve for per-vertex rotations and deformed vertices. As a differentiable layer, dARAP is paired with a visual loss from a text-to-image model to drive deformations toward style prompts, altogether giving us Geometry in Style.
Poster
Hongda Liu · Longguang Wang · Ye Zhang · Ziru YU · Yulan Guo

[ ExHall D ]

Abstract
Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.
Poster
Pengcheng Xu · Boyuan Jiang · Xiaobin Hu · Donghao Luo · Qingdong He · Jiangning Zhang · Chengjie Wang · Yunsheng Wu · Charles Ling · Boyu Wang

[ ExHall D ]

Abstract
Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types. Experiments on various scenarios demonstrate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
Poster
Toan Nguyen · Kien Do · Duc Kieu · Thin Nguyen

[ ExHall D ]

Abstract
We introduce a theoretical framework for diffusion-based image editing by formulating it as a reverse-time bridge modeling problem. This approach modifies the backward process of a pretrained diffusion model to construct a bridge that converges to an implicit distribution associated with the editing target at time 0. Building on this framework, we propose h-Edit, a novel editing method that utilizes Doob's h-transform and Langevin Monte Carlo to decompose the update of an intermediate edited sample into two components: a "reconstruction" term and an "editing" term. This decomposition provides flexibility, allowing the reconstruction term to be computed via existing inversion techniques and enabling the combination of multiple editing terms to handle complex editing tasks. To our knowledge, h-Edit is the first training-free method capable of performing simultaneous text-guided and reward-model-based editing. Extensive experiments, both quantitative and qualitative, show that h-Edit outperforms state-of-the-art baselines in terms of editing effectiveness and faithfulness.
Poster
Jinqi Luo · Tianjiao Ding · Kwan Ho Ryan Chan · Hancheng Min · Chris Callison-Burch · Rene Vidal

[ ExHall D ]

Abstract
Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure (e.g., Cat Dog, Sketch Painting) by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts and phrases. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace, add, or remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual concepts and phrases for the latent …
Poster
Sherry X. Chen · Misha Sra · Pradeep Sen

[ ExHall D ]

Abstract
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix (IP2P) dataset and get over 60K refined samples we then use to fine-tune the IP2P model, guided by our novel Instruct-CLIP-based loss function. The resulting model produces better edits that are more …
Poster
Tong Wang · Ting Liu · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Xiaolin Hu

[ ExHall D ]

Abstract
Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% …
Poster
Bin Xia · Yuechen Zhang · Jingyao Li · Chengyao Wang · Yitong Wang · Xinglong Wu · Bei Yu · Jiaya Jia

[ ExHall D ]

Abstract
Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and …
Poster
Muhammad Shaheryar · Jong Taek Lee · Soon Ki Jung

[ ExHall D ]

Abstract
Recent advances in diffusion models have positioned them as powerful generative frameworks for high-resolution image synthesis across diverse domains. The emerging "h-space" within these models, defined by bottleneck activations in the denoiser, offers promising pathways for semantic image editing similar to GAN latent spaces. However, as demand grows for content erasure and concept removal, privacy concerns highlight the need for identity disentanglement in the latent space of diffusion models. The high-dimensional latent space poses challenges for identity removal, as traversing with random or orthogonal directions often leads to semantically unvalidated regions, resulting in unrealistic outputs.To address these issues, we propose Black Hole-Driven Identity Absorption (BIA) within the latent space of diffusion models for any identity erasure. BIA uses a "black hole" metaphor, where the latent region representing a specified identity acts as an attractor, drawing in nearby latent points of surrounding identities to "wrap" the black hole. Instead on relying on random traversals for optimization, BIA employs an identity absorption mechanism by attracting and wrapping nearby validated latent points associated with other identities to achieve a vanishing effect for specified identity. Our method effectively prevents the generation of a specified identity while preserving other attributes, as validated by improved scores …
Poster
Yibin Wang · Weizhong Zhang · honghui xu · Cheng Jin

[ ExHall D ]

Abstract
Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training.Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications.Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios.To this end, this paper proposes DreamText for high-fidelity scene text synthesis.Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions.This transformation poses a hybrid optimization challenge, involving both discrete and continuous variables. To effectively tackle this challenge, we employ a heuristic alternate optimization strategy. Meanwhile, we jointly train the text encoder and generator to comprehensively learn and utilize the diverse font present in the training dataset. This joint training is seamlessly integrated into the alternate optimization process, fostering a synergistic relationship between learning character embedding and re-estimating character attention.Specifically, in each step, we first encode potential character-generated position information from cross-attention maps into latent character masks. These masks are then utilized to update …
Poster
Yasamin Medghalchi · Moein Heidari · Clayton Allard · Leonid Sigal · Ilker Hacihaliloglu

[ ExHall D ]

Abstract
Deep neural networks (DNNs) offer significant promise for improving breast cancer diagnosis in medical imaging. However, these models are highly susceptible to adversarial attacks—small, imperceptible changes that can mislead classifiers—raising critical concerns about their reliability and security. Traditional attack methods typically either require substantial extra data for malicious model pre-training, or involve a fixed norm perturbation budget, which does not align with human perception of these alterations. In medical imaging, however, this is often unfeasible due to the limited availability of datasets. Building on recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a novel language-guided semantic attack method capable of generating meaningful perturbations driven by text instructions. During the prompt learning phase, our approach leverages learnable prompts within the text encoder to create subtle, yet impactful, perturbations that remain imperceptible while guiding the model towards targeted outcomes.In contrast to current prompt learning-based approaches, our P2P stands out by directly updating text embeddings, avoiding the need for retraining diffusion models or using large pre-trained models which is typically infeasible in medical domain. Further, we leverage the finding that optimizing only the early reverse diffusion steps boosts efficiency while ensuring that the generated adversarial examples incorporate subtle low-frequency noise, thus preserving …
Poster
Andrew Z Wang · Songwei Ge · Tero Karras · Ming-Yu Liu · Yogesh Balaji

[ ExHall D ]

Abstract
Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders.In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 22 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes.Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance.Instead, we explore embeddings from various layers and find that usinglayer-normalized averaging across all layers significantly improves alignment with complex prompts. LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.
Poster
Bingda Tang · Sayak Paul · Boyang Zheng · Saining Xie

[ ExHall D ]

Abstract
Recent advances in text-to-image synthesis have delivered impressive results, yet existing approaches still struggle to align with complex prompts. While decoder-only Large Language Models (LLMs) excel at handling such intricate texts, their integration with text-to-image generative models remains unsatisfactory. The rise of Diffusion Transformers (DiTs) presents a promising path forward via the deep fusion with LLMs. In this work, we explore this deep fusion for text-to-image synthesis by replacing the text stream Transformer in the MM-DiT model with an LLM, establishing shared self-attention between the LLM and DiT models. This design better aligns with the training objective and inference nature of both autoregressive and diffusion models, brigding the gap between the two paradigms. We empirically examine the design spaces of this approach and demonstrate its effectiveness through extensive experiments. We hope the positive evidence will kindle interest in this approach and inspire reflection on the pursuit of utilizing LLMs for text-to-image synthesis.
Poster
Vikash Sehwag · Xianghao Kong · Jingtao Li · Michael Spranger · Lingjuan Lyu

[ ExHall D ]

Abstract
As scaling laws in generative AI push performance, they simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to unlock this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose randomly masking up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only 1,890 USD economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive performance across both automated and human-centric evaluations, as well as high-quality generations, while incurring 118× lower costs than Stable …
Poster
Jiyeon Han · Dahee Kwon · Gayoung Lee · Junho Kim · Jaesik Choi

[ ExHall D ]

Abstract
Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative generation capacity remains limited, as simply adding the term creative" to prompts often fails to yield genuinely creative results. In this paper, we introduce C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 allows user-friendly creativity control in image generation and is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models. Source codes will be publicly available.
Poster
JungWoo Chae · Jiyoon Kim · Jaewoong Choi · Kyungyul Kim · Sangheum Hwang

[ ExHall D ]

Abstract
Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and stabilizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2) Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.
Poster
Yunhong Lu · Qichao Wang · Hengyuan Cao · Xierui Wang · Xiaoyin Xu · Min Zhang

[ ExHall D ]

Abstract
Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation …
Poster
Yuning Qiu · Andong Wang · Chao Li · Haonan Huang · Guoxu Zhou · Qibin Zhao

[ ExHall D ]

Abstract
Recent text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in visual synthesis, yet their performance heavily relies on the quality of input prompts. However, optimizing discrete prompts remains challenging because the discrete nature of tokens prevents the direct application of the gradient descent method and the vast search space of possible token combinations. As a result, existing approaches either suffer from quantization errors when employing continuous optimization techniques or become trapped in local optima due to coordinate-wise greedy search. In this paper, we propose STEPS, a novel Sequential probability Tensor Estimation approach for hard Prompt Search. Our method reformulates discrete prompt optimization as a sequential probability tensor estimation problem, leveraging the inherent low-rank characteristics to address the curse of dimensionality. To further improve the computational efficiency, we develop a memory-bounded sampling approach that shrinks the sequential probability without the iteration step dependency while preserving sequential optimization dynamics. Extensive experiments on various public datasets demonstrate that our method consistently outperforms existing approaches in T2I generation, cross-model prompt transferability, and harmful prompt optimization, validating the effectiveness of the proposed framework.
Poster
Eduard Poesina · Adriana Valentina Costache · Adrian-Gabriel Chifu · Josiane Mothe · Radu Tudor Ionescu

[ ExHall D ]

Abstract
Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, driven by the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (referred to as prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. Additionally, we extend these evaluations to text-to-image retrieval by collecting manual annotations that represent retrieval performance. We thus establish the first joint benchmark for prompt and query performance prediction (PQPP) across both tasks, comprising over 10K queries. Our benchmark enables (i) the comparative assessment of prompt/query difficulty in both image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We evaluate several pre- and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code are publicly available at https://anonymous.4open.science/r/PQPP-D332.
Poster
Renrui Zhang · Chengzhuo Tong · Zhizheng Zhao · Ziyu Guo · Haoquan Zhang · Manyuan Zhang · Jiaming Liu · Peng Gao · Hongsheng Li

[ ExHall D ]

Abstract
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigatation in the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment mechanism, merging the strengths of existing reward models. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation.
Poster
Krishnakant Singh · Simone Schaub-Meyer · Stefan Roth

[ ExHall D ]

Abstract
Object-centric learning aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable a variety of downstream tasks. Yet, object-centric learning struggles on real-world datasets, which contain multiple objects of complex textures and shapes in natural everyday scenes. To address this, we introduce Guided Latent Slot Diffusion (GLASS), a novel slot attention model that learns in the space of generated images and uses semantic and instance guidance modules to learn better slot embeddings for various downstream tasks. Our experiments show that GLASS surpasses state-of-the-art slot attention methods by a wide margin on tasks such as (zero-shot) object discovery and conditional image generation for real-world scenes. Moreover, GLASS enables the first application of slot attention to compositional generation of complex, realistic scenes.
Poster
Jianzong Wu · Chao Tang · Jingbo Wang · Yanhong Zeng · Xiangtai Li · Yunhai Tong

[ ExHall D ]

Abstract
Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The code, model, and dataset will be open-sourced to the community.
Poster
Haoyu Chen · Xiaojie Xu · Wenbo Li · Jingjing Ren · Tian Ye · Songhua Liu · Ying-Cong Chen · Lei Zhu · Xinchao Wang

[ ExHall D ]

Abstract
Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA’s exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.
Poster
Zhaoxing Gan · Mengtian Li · Ruhua Chen · Zhongxia JI · Sichen Guo · Huanling Hu · Guangnan Ye · Zuo Hu

[ ExHall D ]

Abstract
In this work, we introduce StageDesigner, the first comprehensive framework for artistic stage generation using large language models (LLMs) combined with layout-controlled diffusion models. Given the professional requirements of stage scenography, StageDesigner simulates the workflows of seasoned artists to generate immersive 3D stage scenes. Specifically, our approach is divided into three primary modules: Script Analysis, which extracts thematic and spatial cues from input scripts; Foreground Generation, which constructs and arranges essential 3D objects; and Background Generation, which produces a harmonious background aligned with the narrative atmosphere and maintains spatial coherence by managing occlusions between foreground and background elements. Furthermore, we introduce the StagePro-V1 dataset, a dedicated dataset with 276 unique stage scenes spanning different historical styles and annotated with scripts, images, and detailed 3D layouts, specifically tailored for this task. Finally, evaluations using both standard and newly proposed metrics, along with extensive user studies, demonstrate the effectiveness of StageDesigner, showcasing its ability to produce visually and thematically cohesive stages that meet both artistic and spatial coherence standards.
Poster
Aditya Ganeshan · Thibault Groueix · Paul Guerrero · Radomir Mech · Matthew Fisher · Daniel Ritchie

[ ExHall D ]

Abstract
Pattern images are everywhere in the digital and physical worlds, and tools to edit them are valuable. But editing pattern images is tricky: desired edits are often *programmatic*: structure-aware edits that alter the underlying program which generates the pattern. One could attempt to infer this underlying program, but current methods for doing so struggle with complex images and produce unorganized programs that make editing tedious. In this work, we introduce a novel approach to perform programmatic edits on pattern images. By using a *pattern analogy*&mdash;a pair of simple patterns to demonstrate the intended edit&mdash;and a learning-based generative model to execute these edits, our method allows users to intuitively edit patterns. To enable this paradigm, we introduce **SplitWeave**, a domain-specific language that, combined with a framework for sampling synthetic pattern analogies, enables the creation of a large, high-quality synthetic training dataset. We also present **TriFuser**, a Latent Diffusion Model (LDM) designed to overcome critical issues that arise when naively deploying LDMs to this task. Extensive experiments on real-world, artist-sourced patterns reveals that our method faithfully performs the demonstrated edit while also generalizing to related pattern styles beyond its training distribution.
Poster
Shanshan Huang · Haoxuan Li · Chunyuan Zheng · Mingyuan Ge · WeiGao · Lei Wang · Li Liu

[ ExHall D ]

Abstract
Fashion image editing is a valuable tool for designers to convey their creative ideas by visualizing design concepts.With the recent advances in text editing methods, significant progress has been made in fashion image editing. However, they face two key challenges: spurious correlation in training data often induce changes in other regions when editing a given concept, and these models typically lack the ability to edit multiple concepts simultaneously.To address the above challenges, we propose a novel Text-driven Fashion Image ediTing framework called T-FIT to mitigate the impact of spurious correlation by integrating counterfactual reasoning with compositional concept learning to precisely ensure compositional multi-concept fashion image editing relying solely on text descriptions.Specifically, T-FIT includes three key components: (i) counterfactual abduction module, which learns an exogenous variable of the source image by a denoising U-Net model. (ii) concept learning module, which identifies concepts in fashion image editing—such as clothing types and colors and projects a target concept into the space spanned from a series of textual prompts. (iii) concept composition module, which enables simultaneous adjustments of multiple concepts by aggregating each concept’s direction vector obtained from the concept learning module. Extensive experiments demonstrate that our method can efficiently achieve state-of-the-art performance on …
Poster
Yisol Choi · Sangkyung Kwak · Sihyun Yu · Hyungwon Choi · Jinwoo Shin

[ ExHall D ]

Abstract
We present BootControl, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments.Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human.To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image.To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment.Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details.We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
Poster
Zengqun Zhao · Ziquan Liu · Yu Cao · Shaogang Gong · Ioannis Patras

[ ExHall D ]

Abstract
Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a …
Poster
Yuan Wang · Ouxiang Li · Tingting Mu · Yanbin Hao · Kuien Liu · Xiang Wang · Xiangnan He

[ ExHall D ]

Abstract
The success of text-to-image generation enabled by diffusion models has imposed an urgent need to erase unwanted concepts, e.g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner. The twofold demand of concept erasure requires a precise removal of the target concept during generation (i.e., erasure efficacy), while a minimal impact on non-target content generation (i.e., prior preservation). Existing methods are either computationally costly or face challenges in maintaining an effective balance between erasure efficacy and prior preservation. To improve, we propose a precise, fast, and low-cost concept erasure method, called \textbf{Ada}ptive \textbf{V}aule \textbf{D}ecomposer (AdaVD), which is training-free. This method is grounded in a classical linear algebraic orthogonal complement operation, implemented in the value space of each cross-attention layer within the UNet of diffusion models. An effective shift factor is designed to adaptively navigate the erasure strength, enhancing prior preservation without sacrificing erasure efficacy. Extensive experimental results show that the proposed AdaVD is effective at both single and multiple concept erasure, showing a 2- to 10-fold improvement in prior preservation as compared to the second best, meanwhile achieving the best or near best erasure efficacy, when comparing with both training-based and training-free state …
Poster
Jie Ren · Kangrui Chen · Yingqian Cui · Shenglai Zeng · Hui Liu · Yue Xing · Jiliang Tang · Lingjuan Lyu

[ ExHall D ]

Abstract
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.
Poster
Huayang Huang · Xiangye Jin · Jiaxu Miao · Yu Wu

[ ExHall D ]

Abstract
The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics.The strong concealment and transferability of our attack across various scenarios further underscore …
Poster
Zebin You · Xinyu Zhang · Hanzhong Guo · Jingdong Wang · Chongxuan Li

[ ExHall D ]

Abstract
The ultimate goal of generative models is to perfectly capture the data distribution. For image generation, common metrics of visual quality (e.g., FID) and the perceived truthfulness of generated images seem to suggest that we are nearing this goal. However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. Moreover, we uncover an intriguing discrepancy: classifiers can easily differentiate between diffusion models with comparable performance (e.g., U-ViT-H vs. DiT-XL), but struggle to distinguish between models within the same family but of different scales (e.g., EDM2-XS vs. EDM2-XXL). Our methodology carries several important implications. First, it naturally serves as a diagnostic tool for diffusion models by analyzing specific features of generated data. Second, it sheds light on the model autophagy disorder and offers insights into the use of generated data: augmenting real data with generated data is more effective than replacing it.
Poster
Namhyuk Ahn · KiYoon Yoo · Wonhyuk Ahn · Daesik Kim · Seung-Hun Nam

[ ExHall D ]

Abstract
Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at
Poster
Huan Teng · Yuhui Quan · Chengyu Wang · Jun Huang · Hui Ji

[ ExHall D ]

Abstract
Diffusion models, especially Denoising Diffusion Probabilistic Models (DDPMs) and their variants, are prevalent tools in generative AI, making the protection of their Intellectual Property (IP) rights increasingly important. Most existing methods on IP right protection for DDPMs are invasive, e.g., watermarking methods, which alter model parameters and raise concerns about performance degradation, also with requirement for extra computational resources for retraining or fine-tuning. In this paper, we propose the first non-invasive fingerprinting scheme for DDPMs, requiring no parameter changes or fine-tuning, and ensuring that the generation quality of DDPMs remains intact. We introduce a discriminative and robust fingerprint latent space, based on the well-designed crossing route of samples that span the performance border zone of DDPMs, with only black-box access required for the diffusion denoiser in the ownership verification stage. Extensive experiments demonstrate that our fingerprinting approach enjoys both robustness against the often-seen attacks and distinctiveness on various DDPMs, providing an alternative for protecting DDPMs' IP rights without compromising their performance or integrity.
Poster
Haoyue Bai · Yiyou Sun · Wei Cheng · Haifeng Chen

[ ExHall D ]

Abstract
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real-world scenarios. In this work, we introduce a novel black-box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt-and-recover strategy: by masking part of an image and assessing the model’s ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked-image inputs, we incorporate a cost-efficient surrogate model trained to align with the target model’s distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
Poster
Zhenglin Huang · Jinwei Hu · Yiwei He · Xiangtai Li · Xiaowei Huang · Bei Peng · Xingyu Zhao · Baoyuan Wu · Guangliang Cheng

[ ExHall D ]

Abstract
The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves …
Poster
Anqi Liang · Ciprian Adrian Corneanu · Qianli Feng · Giorgio Giannone · Aleix Martinez

[ ExHall D ]

Abstract
Evaluation of synthetic images is important for both model development and selection. An ideal evaluation should be specific, accurate and aligned with human perception. This paper addresses the problem of evaluating realism of objects in synthetic images. Although methods has been proposed to evaluate holistic realism, there are no methods tailored towards object-centric realism evaluation. In this work, we define a new standard for assessing object-centric realism that follows a shape-texture breakdown and proposes the first object-centric realism evaluation dataset for synthetic images. The dataset contains images generated from state-of-the-art image generative models and is richly annotated at object level across a diverse set of object categories. We then design and train the OLIP model, a dedicated architecture that considerably outperforms any existing baseline on object-centric realism evaluation.
Poster
Reese Kneeland · Paul Scotti · Ghislain St-Yves · Jesse L Breedlove · Kendrick N Kay · Thomas Naselaris

[ ExHall D ]

Abstract
We release NSD-Imagery, a benchmark dataset of human fMRI activity paired with mental images, to complement the existing Natural Scenes Dataset (NSD), a large-scale dataset of fMRI activity paired with seen images that enabled unprecedented improvements in fMRI-to-image reconstruction efforts. Recent models trained on NSD have been evaluated only on seen image reconstruction. Using NSD-Imagery, it is possible to assess how well these models perform on mental image reconstruction. This is a challenging generalization requirement because mental images are encoded in human brain activity with relatively lower signal-to-noise and spatial resolution; however, generalization from seen to mental imagery is critical for real-world applications in medical domains and brain-computer interfaces, where the desired information is always internally generated. We provide benchmarks for a suite of recent NSD-trained open-source visual decoding models (MindEye1, MindEye2, Brain Diffuser, iCNN, Takagi et al.) on NSD-Imagery, and show that the performance of decoding methods on mental images is largely decoupled from performance on vision tasks. We further demonstrate that architectural choices significantly impact cross-decoding performance: models employing simple linear decoding architectures and multimodal feature decoding generalize better to mental imagery, while complex architectures tend to overfit training data recorded exclusively from vision. Our findings indicate that …
Poster
Nikola Zubic · Davide Scaramuzza

[ ExHall D ]

Abstract
State Space Models (SSMs) are powerful tools for modeling sequential data in computer vision and time series analysis domains. However, traditional SSMs are limited by fixed, one-dimensional sequential processing, which restricts their ability to model non-local interactions in high-dimensional data. While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. Using Chazelle's Minimum Spanning Tree algorithm, GG-SSMs adapt to the inherent data structure, enabling robust feature propagation across dynamically generated graphs and efficiently modeling complex dependencies. We validate GG-SSMs on 11 diverse datasets, including event-based eye-tracking, ImageNet classification, optical flow estimation, and six time series datasets. GG-SSMs achieve state-of-the-art performance across all tasks, surpassing existing methods by significant margins. Specifically, GG-SSM attains a top-1 accuracy of 84.9% on ImageNet, outperforming prior SSMs by 1%, reducing the KITTI-15 error rate to 2.77%, and improving eye-tracking detection rates by up to 0.33% with fewer parameters. These results demonstrate that dynamic scanning based on feature relationships significantly improves SSMs' representational power and efficiency, offering a versatile tool …
Poster
Fiona Ryan · Ajay Bati · Sangmin Lee · Daniel Bolya · Judy Hoffman · James Rehg

[ ExHall D ]

Abstract
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person’s gaze target requires reasoning both about the person’s appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices.
Poster
Jingkang Yang · Shuai Liu · Hongming Guo · Yuhao Dong · Xiamengwei Zhang · Sicheng Zhang · Pengyun Wang · Zitang Zhou · Binzhu Xie · Ziyue Wang · Bei Ouyang · Zhengyu Lin · Marco Cominelli · Zhongang Cai · Bo Li · Yuanhan Zhang · Peiyuan Zhang · Fangzhou Hong · Joerg Widmer · Francesco Gringoli · Lei Yang · Ziwei Liu

[ ExHall D ]

Abstract
We introduce **EgoLife**, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities—including discussions, shopping, cooking, socializing, and entertainment—using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the **EgoLife Dataset**, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations.To address the key technical challenges of **1)** developing robust visual-audio models for egocentric data, **2)** enabling accurate identity recognition, and **3)** facilitating long-context question answering over extensive temporal information, we introduce **EgoButler**, an integrated system comprising **EgoGPT** and **EgoRAG**. EgoGPT is a vision-language model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal …
Poster
Ho Kei Cheng · Masato Ishii · Akio Hayakawa · Takashi Shibuya · Alexander G. Schwing · Yuki Mitsufuji

[ ExHall D ]

Abstract
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework (**MMAudio**). In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and models will be made available.
Poster
Shentong Mo · Yibing Song

[ ExHall D ]

Abstract
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose Foley-Flow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are …
Poster
Chen Liu · Peike Li · Liying Yang · Dadong Wang · Lincheng Li · Xin Yu

[ ExHall D ]

Abstract
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences—such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model’s attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional …
Poster
Boseung Jeong · Jicheol Park · Sungyeon Kim · Suha Kwak

[ ExHall D ]

Abstract
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content.Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals.In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment.Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.
Poster
Yuji Wang · Haoran Xu · Yong Liu · Jiaze Li · Yansong Tang

[ ExHall D ]

Abstract
Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5\% in J&F on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available soon.
Poster
Sihong Huang · Jiaxin Wu · Xiaoyong Wei · Yi Cai · Dongmei Jiang · Yaowei Wang

[ ExHall D ]

Abstract
Understanding human behavior and the environmental information in the egocentric video is very challenging due to the invisibility of some actions (e.g., laughing and sneezing) and the local nature of the first-person view. Leveraging the corresponding exocentric video to provide global context has shown promising results. However, existing visual-to-visual and visual-to-textual Ego-Exo video alignment methods struggle with the problem that there could be non-visual overlap for the same activity. To address this, we propose using sound as a bridge, as audio is often consistent across Ego-Exo videos. However, direct audio-to-audio alignment lacks context. Thus, we introduce two context-aware sound modules: one aligns audio with vision via a visual-audio cross-attention module, and another aligns text with sound closed caption generated by LLM. Experimental results on two Ego-Exo video association benchmarks show that either of the two proposed modules manages to improve the state-of-the-art methods. Moreover, the proposed sound-aware egocentric or exocentric representation boosts the performance of downstream tasks, such as action recognition of exocentric videos and scene recognition of egocentric videos. The code and models can be accessed at https://github.com/open_upon_acceptance.
Poster
Songhao Han · Wei Huang · Hairong Shi · Le Zhuo · Xiu Su · Shifeng Zhang · Xu Zhou · Xiaojuan Qi · Yue Liao · Si Liu

[ ExHall D ]

Abstract
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities.
Poster
Yulu Pan · Ce Zhang · Gedas Bertasius

[ ExHall D ]

Abstract
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains more than 4,400 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. We will release the dataset upon the acceptance of the paper.
Poster
Lan Wang · Yujia Chen · Wen-Sheng Chu · Vishnu Naresh Boddeti · Du Tran

[ ExHall D ]

Abstract
Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces **SE**mantic **A**ttention **L**earning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.
Poster
Lehan Yang · Lu Qi · Xiangtai Li · Sheng Li · Varun Jampani · Ming-Hsuan Yang

[ ExHall D ]

Abstract
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset Panda-Dense, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness. All source codes and models will be made publicly available.
Poster
Tiehan Fan · Kepan Nan · Rui Xie · Penghao Zhou · Zhenheng Yang · Chaoyou Fu · Xiang Li · Jian Yang · Ying Tai

[ ExHall D ]

Abstract
Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.
Poster
Weijia Wu · Mingyu Liu · Zeyu Zhu · Haoen Feng · Xi Xia · Wen Wang · Kevin Qinghong Lin · Chunhua Shen · Mike Zheng Shou

[ ExHall D ]

Abstract
Recent advancements in video generation models, such as Stable Video Diffusion, have shown promising results, but these works primarily focus on short videos, often limited to a single scene and lacking a rich storyline. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is currently no publicly accessible dataset specifically designed for analyzing, evaluating, and training models for long video generation. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) character consistency across scenes, (2) long videos with rich and coherent storylines, and (3) multi-scene narratives. MovieBench features three distinct levels of annotation: the movie level, which provides a broad overview of the film; the scene level, offering a mid-level understanding of the narrative; and the shot level, which emphasizes specific moments with detailed descriptions.
Poster
feilong tang · Chengzhi Liu · Zhongxing Xu · Ming Hu · Zile Huang · Haochen Xue · Ziyang Chen · Zelin Peng · Zhiwei Yang · Sijin Zhou · Wenxue Li · Yulong Li · Wenxuan Song · Shiyan Su · Wei Feng · Jionglong Su · Mingquan Lin · Yifan Peng · Xuelian Cheng · Imran Razzak · Zongyuan Ge

[ ExHall D ]

Abstract
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, …
Poster
chenkai zhang · Yiming Lei · Zeming Liu · Haitao Leng · Shaoguo Liu · Tingting Gao · Qingjie Liu · Yunhong Wang

[ ExHall D ]

Abstract
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus solely on standalone videos and assess only “visual elements” in videos, such as human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding to solve. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance the model's capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities for understanding narrative-driven series, guiding future MLLM development.
Poster
Mor Shpigel Nacson · Aviad Aberdam · Roy Ganz · Elad Ben Avraham · Alona Golts · Yair Kittenplon · Shai Mazor · Ron Litman

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance.We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding.In limited-token regimes (448×448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM’s potential for applications requiring high-performance and efficiency.
Poster
Sagnik Majumder · Tushar Nagarajan · Ziad Al-Halah · Reina Pradhan · Kristen Grauman

[ ExHall D ]

Abstract
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive “best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video—no language or camera poses—and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
Poster
Zhongwei Ren · Yunchao Wei · Xun Guo · Yao Zhao · Bingyi Kang · Jiashi Feng · Xiaojie Jin

[ ExHall D ]

Abstract
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop an autoregressive video generation model, Visioner, trained exclusively on raw video data, and test its knowledge acquisition abilities in video-based Go and robotic control environments. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning extensive knowledge, and (2) the compactness of visual representations significantly enhances learning efficiency. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM). Remarkably, Visioner reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research.
Poster
Chris Dongjoo Kim · Jihwan Moon · Sangwoo Moon · Heeseung Yun · Sihaeng Lee · Aniruddha Kembhavi · Soonyoung Lee · Gunhee Kim · Sangho Lee · Christopher Clark

[ ExHall D ]

Abstract
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness.One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness.By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute.Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zero-shot video retrieval tasks, using as little as 5\% of the data while incurring minimal compute.
Poster
Nina Shvetsova · Arsha Nagrani · Bernt Schiele · Hilde Kuehne · Christian Rupprecht

[ ExHall D ]

Abstract
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities.Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias video benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias — determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias — assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias —evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. Since our new toolkit allows us to analyze representation biases at scale without additional human annotation, we conduct a systematic and comprehensive analysis of representation biases in 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 33 …
Poster
Dahun Kim · AJ Piergiovanni · Ganesh Satish Mallya · Anelia Angelova

[ ExHall D ]

Abstract
We introduce a benchmark and learning framework for advancing video-text compositionality understanding, aimed at enhancing vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark focuses on fine-grained video-text alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (\eg ActivityNet-Captions, YouCook2), we create challenging negative samples with subtle temporal disruptions such as reordering, action word replacements, partial captioning, and combined disruptions that comprehensively test models’ compositional sensitivity across extended, cohesive video-text sequences. To enhance model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and progressively reduces similarity for increasingly disrupted pairs, encouraging fine-grained compositional alignment. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences, facilitating effective compositional learning. We evaluate large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in video-text compositionality. Our work provides a comprehensive framework for assessing and advancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
Poster
Shyamal Buch · Arsha Nagrani · Anurag Arnab · Cordelia Schmid

[ ExHall D ]

Abstract
Video-language models have shown promise for addressing a range of multimodal tasks for video understanding, such as video question-answering. However, the inherent computational challenges of processing long video data and increasing model sizes have led to standard approaches that are limited by the number of frames they can process. In this work, we propose the Flexible Frame Selector (FFS), a learnable policy model with a new flexible selection operation, that helps alleviate input context restrictions by enabling video-language models to focus on the most informative frames for the downstream multimodal task, without adding undue processing cost. Our method differentiates from prior work due to its learnability, efficiency, and flexibility. We verify the efficacy of our method on standard video-question answering and reasoning benchmarks, and observe that our model can improve base video-language model accuracy while reducing the number of downstream processed frames.
Poster
Joya Chen · Yiqi Lin · Ziyun Zeng · Wei Li · Zejun Ma · Mike Zheng Shou

[ ExHall D ]

Abstract
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method enables the model to learn fine-grained vision-language correlations in temporal. To support this, we introduce a series of data processing techniques on YouTube videos and closed captions (CC), resulting in 30M pre-training data samples and 1.5M for instruction tuning. Benefiting from our training paradigm, the trained model is powerful at streaming applications and can naturally support real-time video commentary. We also introduce a new benchmark focused on sports commentary and event understanding, a domain where live performance is critical. Experiments show that our model outperforms state-of-the-art models in both accuracy and latency. Additionally, our model achieves state-of-the-art or competitive results on several mainstream benchmarks, demonstrating its broad generalizability. We will release the codes, datasets, and models to facilitate further research.
Poster
Md Mohaiminul Islam · Tushar Nagarajan · Huiyu Wang · Gedas Bertasius · Lorenzo Torresani

[ ExHall D ]

Abstract
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. To lower the computational cost, most prior methods rely on compression strategies, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce \model, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a token sequence that is orders of magnitude smaller for efficient LLM processing. Extensive experiments demonstrate that \model\ achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including EgoSchema, NextQA, TempCompass, and MVBench.
Poster
Yangliu Hu · Zikai Song · Na Feng · Yawei Luo · Junqing Yu · Yi-Ping Phoebe Chen · Wei Yang

[ ExHall D ]

Abstract
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF2T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos;(2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities.We assessed multiple models and validated the effectiveness of SF2T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
Poster
Xi Tang · Jihao Qiu · Lingxi Xie · Yunjie Tian · Jianbin Jiao · Qixiang Ye

[ ExHall D ]

Abstract
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Our code and models will be open-sourced.
Poster
Haoxing Chen · Zizheng Huang · Yan Hong · YANSHUO WANG · Zhongcai Lyu · Zhuoer Xu · Jun Lan · Zhangxuan Gu

[ ExHall D ]

Abstract
Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capacity brought by these additional parameters helps better fit the video-specific inductive biases, existing methods require learning a large number of parameters and are prone to catastrophic forgetting of the original generalizable knowledge. In this paper, we propose a simple yet effective Multi-modal Spatio-Temporal Adapter (MSTA) to improve the alignment between representations in the text and vision branches, achieving a balance between general knowledge and task-specific knowledge. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint. This constraint involves feeding template inputs (i.e., a video of {cls}'') into the trainable language branch, while LLM-generated spatio-temporal descriptions are input into the pre-trained language branch, enforcing consistency between the outputs of the two branches. This mechanism prevents over-fitting to downstream tasks and improves the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA …
Poster
shaoyu liu · Jianing Li · guanghui zhao · Yunjian Zhang · Xin Meng · Fei Richard Yu · Xiangyang Ji · Ming Li

[ ExHall D ]

Abstract
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive evaluation benchmark, and extensive experiments demonstrate that EventGPT outperforms previous state-of-the-art MLLMs …
Poster
Trong-Thuan Nguyen · Pha Nguyen · Jackson Cothren · Alper Yilmaz · Khoa Luu

[ ExHall D ]

Abstract
Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
Poster
Mu Chen · Liulei Li · Wenguan Wang · Yi Yang

[ ExHall D ]

Abstract
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline.Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DiffVsgg, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects.This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DiffVsgg further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames.Extensive experiments on three setups of Action Genome demonstrate the superiority of DiffVsgg. Our code shall be released.
Poster
Weitao Feng · Hang Zhou · Jing Liao · Li Cheng · Wenbo Zhou

[ ExHall D ]

Abstract
We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CASAGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.
Poster
Sitong Gong · Yunzhi Zhuge · Lu Zhang · Zongxin Yang · Pingping Zhang · Huchuan Lu

[ ExHall D ]

Abstract
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens. Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that utilize MLLM’s autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2’s occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be made publicly available.
Poster
Zixuan Chen · Jiaxin Li · Junxuan Liang · Liming Tan · Yejie Guo · Cewu Lu · Yonglu Li

[ ExHall D ]

Abstract
Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g. phase transitions.However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M3-VOS)}, to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M3-VOS, yielding several key insights. Notably, current appearance-based approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-and-play model that improves its performance by reversal refinement.Our data and code will be publicly available.
Poster
Fei Li · Wenxuan Liu · Jingjing Chen · Ruixu Zhang · Yuran Wang · Xian Zhong · Zheng Wang

[ ExHall D ]

Abstract
Open Vocabulary Video Anomaly Detection (OVVAD) aims to detect and categorize both base and novel anomalies. However, there are two specific challenges related to novel anomalies that remain unexplored by existing methods. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often miscategorized as visually similar base instances. To address the aforementioned challenges, we investigate supportive information from multiple sources, aiming to reduce detection ambiguity by leveraging multiple levels of visual data with matching textual information. Additionally, we propose introducing relationships between labels to guide the encoding of new labels, thereby enhancing the alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. Our resulting Anomize framework effectively addresses these challenges, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its strength in OVVAD.
Poster
Chen Tang · Xinzhu Ma · Encheng Su · Xiufeng Song · Xiaohong Liu · Wei-Hong Li · Lei Bai · Wanli Ouyang · Xiangyu Yue

[ ExHall D ]

Abstract
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce UniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Our code and dataset will be released soon.
Poster
Xiaoyong Chen · Yong Guo · Jiaming Liang · Sitong Zhuang · Runhao Zeng · Xiping Hu

[ ExHall D ]

Abstract
Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a **Progressive Block Drop** method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25\% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield …
Poster
Yuting Zhang · Hao Lu · Qingyong Hu · Yin Wang · Kaishen Yuan · Xin Liu · Kaishun Wu

[ ExHall D ]

Abstract
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a Resisting Logical Oblivion optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code will be available on GitHub.
Poster
Hongda Liu · Yunfan Liu · Min Ren · Hao Wang · Yunlong Wang · Zhenan Sun

[ ExHall D ]

Abstract
In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The source code is enclosed in the supplementary material and will be released upon acceptance.
Poster
Utkarsh Mall · Cheng Perng Phoo · Mia Chiquier · Bharath Hariharan · Kavita Bala · Carl Vondrick

[ ExHall D ]

Abstract
Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation.
Poster
Gaozheng Pei · Shaojie Lyu · Gong Chen · Ke Ma · Qianqian Xu · Yingfei Sun · Qingming Huang

[ ExHall D ]

Abstract
Existing diffusion-based purification methods aim to disrupt adversarial perturbations by introducing a certain amount of noise through a forward diffusion process, followed by a reverse process to recover clean examples. However, this approach is fundamentally flawed: the uniform operation of the forward process across all pixels compromises normal pixels while attempting to combat adversarial perturbations, resulting in the target model producing incorrect predictions. Simply relying on low-intensity noise is insufficient for effective defense. To address this critical issue, we implement a heterogeneous purification strategy grounded in the interpretability of neural networks. Our method decisively applies higher-intensity noise to specific pixels that the target model focuses on while remaining pixels are subjected to only low-intensity noise. This requirement motivates us to redesign the sampling process of the diffusion model, allowing for effective removal of varying noise levels. Furthermore, to evaluate our method against strong adaptative attack, our proposed method sharply reduces time cost and memory usage through a single step resampling. The empirical evidence from extensive experiments across three datasets demonstrates that our method outperforms most current adversarial training and purification techniques by a substantial margin. Code is available at \url{https://anonymous.4open.science/r/Purification-35BE-0829}.
Poster
Xiaofan Bai · Shixin Li · Xiaojing Ma · Bin Benjamin Zhu · Dongmei Zhang · Linchen Yu

[ ExHall D ]

Abstract
Cloud-based AI systems offer significant benefits but also introduce vulnerabilities, making deep neural network (DNN) models susceptible to malicious tampering. This tampering may involve harmful behavior injection or resource reduction, compromising model integrity and performance. To detect model tampering, hard-label fingerprinting techniques generate sensitive samples to probe and reveal tampering. Existing fingerprinting methods are mainly based on \textbf{gradient-defined sensitivity} or \textbf{decision boundary}, with the latter showing a manifest superior detection performance. However, existing decision-boundary-based fingerprinting methods remain conceptual, lacking a theoretical explanation for why samples near the decision boundary are more sensitive to tampering. Moreover, all existing fingerprinting methods either suffer from insufficient sensitivity or incur high computational costs.In this paper, we provide the first theoretical justification for why samples near the decision boundary are more sensitive to tampering-induced shifts than the faraway. Based on this, we further propose \textbf{Steep-Decision-Boundary Fingerprinting (SDBF)}, a novel lightweight approach for hard-label tampering detection. SDBF places fingerprint samples near the \textbf{steep decision boundary}, where the outputs of samples are inherently highly sensitive to tampering. We also design a \textbf{Max Boundary Coverage Strategy (MBCS)}, which enhances samples' diversity over the decision boundary. Theoretical analysis and extensive experimental results show that SDBF outperforms existing SOTA hard-label …
Poster
Ziang Li · Hongguang Zhang · Juan Wang · Meihui Chen · Hongxin Hu · Wenzhe Yi · Xiaoyang Xu · Mengda Yang · Chenjun Ma

[ ExHall D ]

Abstract
Model Inversion Attacks (MIAs) aim to reconstruct private training data from models, leading to privacy leakage, particularly in facial recognition systems. Although many studies have enhanced the effectiveness of white-box MIAs, less attention has been paid to improving efficiency and utility under limited attacker capabilities. Existing black-box MIAs necessitate an impractical number of queries, incurring significant overhead. Therefore, we analyze the limitations of existing MIAs and introduce Surrogate Model-based Inversion with Long-tailed Enhancement (SMILE), a high-resolution oriented and query-efficient MIA for the black-box setting. We begin by analyzing the initialization of MIAs from a data distribution perspective and propose a long-tailed surrogate training method to obtain high-quality initial points. We then enhance the attack's effectiveness by employing the gradient-free black-box optimization algorithm selected by NGOpt. Our experiments show that SMILE outperforms existing state-of-the-art black-box MIAs while requiring only about 5% of the query overhead.
Poster
Meng Pang · Wenjun Zhang · Nanrun Zhou · Shengbo Chen · Hong Rao

[ ExHall D ]

Abstract
Face normalization aims to enhance the robustness and effectiveness of face recognition systems by mitigating intra-personal variations in expressions, poses, occlusions, illuminations, and domains. Existing methods face limitations in handling multiple variations and adapting to cross-domain scenarios. To address these challenges, we propose a novel Unified Multi-Domain Face Normalization Network (UMFN) model, which can process face images with various types of facial variations from different domains, and reconstruct frontal, neutral-expression facial prototypes in the target domain. As an unsupervised domain adaptation model, UMFN facilitates concurrent training on multiple datasets across domains and demonstrates strong prototype reconstruction capabilities. Notably, UMFN serves as a joint prototype and feature learning framework, enabling the simultaneous extraction of domain-agnostic identity features through a decoupling mapping network and a feature domain classifier for adversarial training. Moreover, we design an efficient Heterogeneous Face Recognition (HFR) network that fuses domain-agnostic and identity-discriminative features for HFR, and introduce contrastive learning to enhance identity recognition accuracy. Empirical studies on diverse cross-domain face datasets validate the effectiveness of our proposed method.
Poster
Zeqi Zhu · Ibrahim Batuhan Akkaya · Luc Waeijen · Egor Bondarev · Arash Pourtaherian · Orlando Moreira

[ ExHall D ]

Abstract
Deep Neural Networks (DNNs) are accurate but compute-intensive, leading to substantial energy consumption during inference. Exploiting temporal redundancy through Δ-Σ convolution in video processing has proven to greatly enhance computation efficiency. However, temporal Δ-Σ DNNs typically require substantial memory for storing neuron states to compute inter-frame differences, hindering their on-chip deployment. To mitigate this memory cost, directly compressing the states can disrupt the linearity of temporal Δ-Σ convolution, causing accumulated errors in long-term Δ-Σ processing. Thus, we propose MEET, an optimization framework for MEmory-Efficient Temporal Δ-Σ DNNs. MEET transfers the state compression challenge to a well-established weight compression problem by trading fewer activations for more weights and introduces a co-design of network architecture and suppression method to optimize for mixed spatial-temporal execution. Evaluations on three vision applications demonstrate a reduction of 5.113.3 × in total memory compared to the most computation-efficient temporal DNNs, while preserving the computation efficiency and model accuracy in long-term Δ-Σ processing. MEET facilitates the deployment of temporal Δ-Σ DNNs within on-chip memory of embedded event-driven platforms, empowering low-power edge processing.
Poster
Xiao Wang · Yu Jin · Wentao Wu · Wei Zhang · Lin Zhu · Bo Jiang · Yonghong Tian

[ ExHall D ]

Abstract
Object detection in event streams has emerged as a cutting-edge research area, demonstrating superior performance in low-light conditions, scenarios with motion blur, and rapid movements. Current detectors leverage spiking neural networks, Transformers, or convolutional neural networks as their core architectures, each with its own set of limitations including restricted performance, high computational overhead, or limited local receptive fields. This paper introduces a novel MoE (Mixture of Experts) heat conduction-based object detection algorithm that strikingly balances accuracy and computational efficiency. Initially, we employ a stem network for event data embedding, followed by processing through our innovative MoE-HCO blocks. Each block integrates various expert modules to mimic heat conduction within event streams. Subsequently, an IoU-based query selection module is utilized for efficient token extraction, which is then channeled into a detection head for the final object detection process. Furthermore, we are pleased to introduce EvDET200K, a novel benchmark dataset for event-based object detection. Captured with a high-definition Prophesee EVK4-HD event camera, this dataset encompasses 10 distinct categories, 200,000 bounding boxes, and 10,054 samples, each spanning 2 to 5 seconds. We also provide comprehensive results from over 15 state-of-the-art detectors, offering a solid foundation for future research and comparison.
Poster
Yi-Xing Peng · Yu-Ming Tang · Kun-Yu Lin · Qize Yang · Jingke Meng · Xihan Wei · Wei-Shi Zheng

[ ExHall D ]

Abstract
Person re-identification (ReID) is to associate images of individuals from different camera views against cross-view variations. Like other surveillance technologies, Re-ID faces serious privacy challenges, particularly the potential for unauthorized tracking. Although various tasks (e.g., face recognition) have developed machine unlearning techniques to address privacy concerns, such approaches have not yet been explored within the Re-ID field. In this work, we pioneer the exploration of the person de-reidentification (De-ReID) problem and present its inherent challenges. In the context of ReID, De-ReID is to unlearn the knowledge about accurately matching specific persons so that these unlearned persons'' cannot be re-identified across cameras for privacy guarantee. The primary challenge is to achieve the unlearning without degrading the identity-discriminative feature embeddings to ensure the model's utility. To address this, we formulate a De-ReID framework that utilizes a labeled dataset of unlearned persons for unlearning and an unlabeled dataset of other persons for knowledge preservation. Instead of unlearning based on (pseudo) identity labels, we introduce a variation-guided identity shift mechanism that unlearns the specific persons by fitting the variations in their images, irrespective of their identity, while preserving ReID ability on other persons by overcoming the variations in other images. As a result, the …
Poster
Yu Mao · Jun Wang · Nan Guan · Chun Jason Xue

[ ExHall D ]

Abstract
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we develop a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
Poster
Runmin Jiang · Jackson Daggett · Shriya Pingulkar · Yizhou Zhao · Priyanshu Dhingra · Daniel Brown · Qifeng Wu · Xiangrui Zeng · Xingjian Li · Min Xu

[ ExHall D ]

Abstract
Subtomogram alignment is a critical task in cryo-electron tomography (cryo-ET) analysis, essential for achieving high-resolution reconstructions of macromolecular complexes. However, learning effective positional representations remains challenging due to limited labels and high noise levels inherent in cryo-ET data. In this work, we address this challenge by proposing a self-supervised learning approach that leverages intrinsic geometric transformations as implicit supervisory signals, enabling robust representation learning despite data scarcity. We introduce BOE-ViT, the first Vision Transformer (ViT) framework for 3D subtomogram alignment. Recognizing that traditional ViTs lack equivariance and are therefore suboptimal for orientation estimation, we enhance the model with two innovative modules that introduce equivariance include 1) the Polyshift module for improved shift estimation and 2) Multi-Axis Rotation Encoding (MARE) for enhanced rotation estimation. Experimental results demonstrate that BOE-ViT significantly outperforms state-of-the-art methods. Notably, at SNR 0.01 dataset, our approach achieves a 77.3\% reduction in rotation estimation error and a 62.5\% reduction in translation estimation error, effectively overcoming the challenges in cryo-ET subtomogram alignment.
Poster
Wei Lin · Chenyang ZHAO · Antoni B. Chan

[ ExHall D ]

Abstract
Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample containing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training process encounters issues as the confidence for pseudo-labels fails to propagate to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) matching scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence …
Poster
Shijia Zhao · Qiming Xia · Xusheng Guo · Pufan Zou · Maoji Zheng · Hai Wu · Chenglu Wen · Cheng Wang

[ ExHall D ]

Abstract
Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions.Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code will be made publicly available.
Poster
Wei-En Tai · Yu-Lin Shih · Cheng Sun · Yu-Chiang Frank Wang · Hwann-Tzong Chen

[ ExHall D ]

Abstract
Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects.Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.
Poster
Weiguang Zhao · Rui Zhang · Qiufeng Wang · Guangliang Cheng · Kaizhu Huang

[ ExHall D ]

Abstract
3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. In …
Poster
Hui Liu · Chen Jia · Fan Shi · Xu Cheng · Shengyong Chen

[ ExHall D ]

Abstract
Pixel-level segmentation of structural cracks across various scenarios remains a considerable challenge. Current methods encounter challenges in effectively modeling crack morphology and texture, facing challenges in balancing segmentation quality with low computational resource usage. To overcome these limitations, we propose a lightweight Structure-Aware Vision Mamba Network (SCSegamba), capable of generating high-quality pixel-level segmentation maps by leveraging both the morphological information and texture cues of crack pixels with minimal computational cost. Specifically, we developed a Structure-Aware Visual State Space module (SAVSS), which incorporates a lightweight Gated Bottleneck Convolution (GBC) and a Structure-Aware Scanning Strategy (SASS). The key insight of GBC lies in its effectiveness in modeling the morphological information of cracks, while the SASS enhances the perception of crack topology and texture by strengthening the continuity of semantic information between crack pixels. Experiments on crack benchmark datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods, achieving the highest performance with only 2.8M parameters. On the multi-scenario dataset, our method reached 0.8390 in F1 score and 0.8479 in mIoU.
Poster
Zihan Lin · Zilei Wang · Xu Wang

[ ExHall D ]

Abstract
Despite the significant progress in continual image segmentation, existing arts still strive to balance between stability and plasticity. Additionally, they are specialist to specific tasks and models, which hinders the extension to more general situations. In this work, we present CUE, a novel Continual Universal sEgmentation pipeline that not only inherently tackles the stability-plasticity dilemma, but unifies any segmentation across tasks and models as well. Our key insight: any segmentation task can be reformulated as an understanding-then-refinement paradigm, which is inspired by humans' visual perception system to first perform high-level semantic understanding, then focus on low-level vision cues. We claim three desiderata for this design: Continuity by inherently avoiding the stability-plasticity dilemma via exploiting the natural differences between high-level and low-level knowledge. Generality by unifying and simplifying the landscape towards various segmentation tasks. Efficiency as an interesting by-product by significantly reducing the research effort. Our resulting model, built upon this pipeline by complementary expert models, shows significant improvements over previous state-of-the-arts across various segmentation tasks and datasets. We believe that our work is a significant step towards making continual segmentation more universal and practicable.
Poster
Tanner Schmidt · Richard Newcombe

[ ExHall D ]

Abstract
This paper presents Segment This Thing (STT), a new efficient image segmentation model designed to produce a single segment given a single point prompt. Instead of following prior work and increasing efficiency by decreasing model size, we gain efficiency by foveating input images. Given an image and a point prompt, we extract a crop centered on the prompt and apply a novel variable-resolution patch tokenization in which patches are downsampled at a rate that increases with increased distance from the prompt. This approach yields far fewer image tokens than uniform patch tokenization. As a result we can drastically reduce the computational cost of segmentation without reducing model size. Furthermore, the foveation focuses the model on the region of interest, a potentially useful inductive bias. We show that our Segment This Thing model is more efficient than prior work while remaining competitive on segmentation benchmarks. It can easily run at interactive frame rates on consumer hardware and is thus a promising tool for augmented reality or robotics applications.
Poster
Jiyong Rao · Brian Nlong Zhao · Yu Wang

[ ExHall D ]

Abstract
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, e.g. CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings.
Poster
Wenhuan Huang · Yi JI · guiqian zhu · Ying Li · chunping Liu

[ ExHall D ]

Abstract
In scene graph generation (SGG), the accurate prediction of unseen triples is essential for its effectiveness in downstream vision-language tasks. We hypothesize that the predicates of unseen triples can be viewed as transformations of seen predicates in feature space, and the essence of the zero-shot task is to bridge the gap caused by this transformation. Traditional models, however, have difficulty addressing this challenge, which we attribute to their inability to model the predicates equivariant. To overcome this limitation, we introduce a novel framework based on capsule networks (CAPSGG). We propose a Three-Stream Pipeline that generates modality-specific representations for predicates, while building low-level predicate capsules of these modalities. Then these capsules are aggregated into high-level predicate capsules using a Routing Capsule Layer. In addition, we introduce GroupLoss to aggregate capsules with the same predicate label into groups. This replaces the global loss with the intra-group loss, effectively balancing the learning of predicate invariance and equivariant features, while mitigating the impact of the severe long-tail distribution of the predicate categories. Our extensive experiments demonstrate the notable superiority of our approach over state-of-the-art methods, with zero-shot indicators outperforming up to 132.26\% on SGCls task than the T-CAR [21]. Our code will be available …
Poster
Yun Chang · Leonor Fermoselle · Duy Ta · Bernadette Bucher · Luca Carlone · Jiuguang Wang

[ ExHall D ]

Abstract
While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks —a process called hierarchical task analysis— is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis —to generate the task breakdown— with task-driven scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods
Poster
Fan-Yun Sun · Weiyu Liu · Siyi Gu · Dylan Lim · Goutam Bhat · Federico Tombari · Manling Li · Nick Haber · Jiajun Wu

[ ExHall D ]

Abstract
Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in dense scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.
Poster
Jinchang Zhang · Guoyu Lu

[ ExHall D ]

Abstract
Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. With the development of deep learning, current methods primarily rely on inter-image relationships to train supervised models, often overlooking intrinsic information provided by the camera itself. From the perspective of embodied intelligence, perception and understanding are not only based on external data inputs but are also closely linked to the physical environment in which the model is embedded. Following this concept, we propose a method that embeds the camera model and its physical characteristics into a deep learning model to compute Embodied Scene Depth through interactions with road environments. This approach leverages the intrinsic properties of the camera and provides robust depth priors without the need for additional equipment.By combining Embodied Scene Depth with RGB image features, the model gains a comprehensive perspective of both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as another dimension of embodied intelligence, embedding them as scale priors for scene understanding, thus enriching the model’s perception of the scene. This integration of image and language — two inherently ambiguous modalities — leverages their complementary strengths …
Poster
Zhiyuan Huang · Ziming Cheng · Junting Pan · Zhaohui Hou · Mingjie Zhan

[ ExHall D ]

Abstract
Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose SpiritSight, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called GUI-Lasagne using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the Universal Block Parsing (UBP) method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. The models and code will be made available upon publications.
Poster
Jianing Yang · Xuweiyi Chen · Nikhil Madaan · Madhavan Iyengar · Shengyi Qian · David Fouhey · Joyce Chai

[ ExHall D ]

Abstract
The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs.
Poster
Lizheng Zu · Lin Lin · Song Fu · Na Zhao · Pan Zhou

[ ExHall D ]

Abstract
Embodied agents based on large language models (LLMs) face significant challenges in collaborative tasks, requiring effective communication and reasonable division of labor to ensure efficient and correct task completion. Previous approaches with simple communication patterns carry erroneous or incoherent agent actions, which can lead to additional risks. To address these problems, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task execution efficiency among embodied agents. CoTS guides multi-agents to discuss long-term strategic plans within a modified Monte Carlo tree, searching along LLM-driven reward functions to provide a more thoughtful and promising approach to cooperation. Another key feature of our method is the introduction of a plan evaluation module, which not only prevents agent action confusion caused by frequent plan updates but also ensures plan updates when the current plan becomes unsuitable. Experimental results show that the proposed method performs excellently in planning, communication, and collaboration on embodied environments (CWAH and TDW-MAT), efficiently completing long-term, complex tasks and significantly outperforming existing methods.
Poster
Aniket Rajiv Didolkar · Andrii Zadaianchuk · Rabiul Awal · Maximilian Seitzer · Efstratios Gavves · Aishwarya Agrawal

[ ExHall D ]

Abstract
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called slots'' or object files'', where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. We find that the proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.
Poster
Haoran Xu · Peixi Peng · Guang Tan · Yiqian Chang · Luntong Li · Yonghong Tian

[ ExHall D ]

Abstract
Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn policy upon high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with self-supervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle this, we propose \textbf{DGC}, a method that \textbf{d}istills \textbf{g}uidance from Visual Language Models (VLMs) alongside self-supervised learning into a \textbf{c}ompact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel prompting-reasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in the sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance. The source code is available in the supplementary material.
Poster
Byung-Kwan Lee · Ryo Hachiuma · Yu-Chiang Frank Wang · Yong Man Ro · Yueh-Hua Wu

[ ExHall D ]

Abstract
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
Poster
Han Xiao · yina xie · Guanxin tan · Yinghao Chen · Rui Hu · Ke Wang · Aojun Zhou · Hao Li · Hao Shao · Xudong LU · Peng Gao · Yafei Wen · Xiaoxin Chen · Shuai Ren · Hongsheng Li

[ ExHall D ]

Abstract
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following.Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-the-art MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios.
Poster
Yiqi Zhu · Ziyue Wang · Can Zhang · Peng Li · Yang Liu

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 16 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing …
Poster
Yuhui Zhang · Yuchang Su · Yiming Liu · Xiaohan Wang · James Burgess · Elaine Sui · Chenyu Wang · Josiah Aklilu · Alejandro Lozano · Anjiang Wei · Ludwig Schmidt · Serena Yeung

[ ExHall D ]

Abstract
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 28 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Poster
Xuli Shen · Hua Cai · Weilin Shen · Qing Xu · Dingding Yu · Weifeng Ge · Xiangyang Xue

[ ExHall D ]

Abstract
With the explosion of human-machine interaction, emotion recognition has reignited attention. Previous works focus on improving visual feature fusion and reasoning from multiple image levels. Although it is non-trivial to deduce a person's emotion by integrating multi-level feature (head, body and context), the emotion recognition results of each level is usually different from one another, which creates inconsistency in the prevailing feature alignment method and decrease recognition performance. In this work, we propose a multi-level image feature refinement method for emotion recognition (CocoER) to mitigate the impact caused by conflicting results from multi-level recognition. First, we leverage cross-level attention to improve visual feature consistency between hierarchically cropped head, body and context windows. Then, vocabulary informed alignment is incorporated into the recognition framework to produce pseudo label and guide hierarchical visual feature refinement. To effectively fuse multi-level feature, we elaborate on a competition process of eliminating irrelevant image level predictions and a coordination process to enhance the feature across all levels. Extensive experiments are executed on two popular datasets, and our method achieves state-of-the-art performance with multi-level interpretation results.
Poster
Jian Liang · Wenke Huang · Guancheng Wan · Qu Yang · Mang Ye

[ ExHall D ]

Abstract
While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance.To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge.Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights.Extensive experimental results demonstrate that even at very high degree of sparsity ( 5\%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.
Poster
Wuyou Xia · Guoli Jia · Sicheng Zhao · Jufeng Yang

[ ExHall D ]

Abstract
Multimodal sentiment analysis has attracted extensive research attention as increasing users share images and texts to express their emotions and opinions on social media. Collecting large amounts of labeled sentiment data is an expensive and challenging task due to the high cost of labeling and unavoidable label ambiguity. Semi-supervised learning (SSL) is explored to utilize the extensive unlabeled data to alleviate the demand for annotation. However, different from typical multimodal tasks, the inconsistent sentiment between image and text leads to the sub-optimal performance of SSL algorithms. To address this issue, we propose SCDR, the first semi-supervised image-text sentiment recognition framework. To better utilize the discriminative features of each modality, we decouple features into common and private parts and then use the private features to train unimodal classifiers for enhanced modality-specific sentiment representation. Considering the complex relation between modalities, we devise a modal selection-based attention module that adaptively assesses the dominant sentiment modality at the sample level to guide the fusion of multimodal representations. Furthermore, to prevent the model predictions from overly relying on common features under the guidance of multimodal labels, we design a pseudo-label filtering strategy based on the matching degree of prediction and dominant modality. Extensive experiments and …
Poster
Kumail Alhamoud · Shaden Alshammari · Yonglong Tian · Guohao Li · Philip H.S. Torr · Yoon Kim · Marzyeh Ghassemi

[ ExHall D ]

Abstract
Many practical vision-language applications require models that understand *negation*, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10\% increase in recall on negated queries and a 40\% boost in accuracy on multiple-choice questions with negated captions.
Poster
Yuanhao Zou · Zhaozheng Yin

[ ExHall D ]

Abstract
Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods such as contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and select relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019. The code will be publicly available.
Poster
Ting Liu · Siyuan Li

[ ExHall D ]

Abstract
Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot referring segmentation models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. The code will be publicly available.
Poster
Kaihang Pan · w l · Zhongqi Yue · Tenglong Ao · Liyu Jia · Wei Zhao · Juncheng Li · Siliang Tang · Hanwang Zhang

[ ExHall D ]

Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve a new SOTA for multimodal comprehension and generation simultaneously compared with other MLLMs.
Poster
bo zhou · Liulei Li · Yujia Wang · 刘华峰 Liu · Yazhou Yao · Wenguan Wang

[ ExHall D ]

Abstract
We present UNIALIGN, a unified model to align an arbitrary number of modalities (e.g., image, text, audio, 3D point cloud, etc.) through one encoder and a single training phase. Existing solutions typically employ distinct encoders for each modality, resulting in increased parameters as the number of modalities grows. In contrast, UNIALIGN proposes a modality-aware adaptation of the powerful mixture-of-experts (MoE) schema and further integrates it with Low-Rank Adaptation (LoRA), efficiently scaling the encoder to accommodate inputs in diverse modalities while maintaining a fixed computational overhead.Moreover, prior work often requires separate training for each extended modality. This leads to task-specific models and further hinders the communication between modalities.To address this, we propose a soft modality binding strategy that aligns all modalities using unpaired data samples across datasets. Two additional training objectives are introduced to distill knowledge from well-aligned anchor modalities and prior multimodal models, elevating UNIALIGN into a high performance multimodal foundation model.Experiments on 11 benchmarks across 6 different modalities demonstrate that UNIALIGN could achieve comparable performance to SOTA approaches, while using merely 7.8M trainable parameters and maintaining an identical model with the same weight across all tasks. Our code shall be released.
Poster
zehan wang · Sashuai zhou · Shaoxuan He · Haifeng Huang · Lihe Yang · Ziang Zhang · Xize Cheng · Shengpeng Ji · Tao Jin · Hengshuang Zhao · Zhou Zhao

[ ExHall D ]

Abstract
Contrastive Language-Image Pre-training (CLIP) learns robust visual models through language supervision, making it a crucial visual encoding technique for various applications. However, CLIP struggles with comprehending spatial concepts in images, potentially restricting the spatial intelligence of CLIP-based AI systems. In this work, we propose SpatialCLIP, an enhanced version of CLIP with better spatial understanding capabilities. To capture the intricate 3D spatial relationships in images, we improve both "visual model" and "language supervision" of CLIP. Specifically, we design 3D-inspired ViT to replace the standard ViT in CLIP. By lifting 2D image tokens into 3D space and incorporating design insights from point cloud networks, our visual model gains greater potential for spatial perception. Meanwhile, captions with accurate and detailed spatial information are very rare. To explore better language supervision for spatial understanding, we re-caption images and perturb their spatial phrases as negative descriptions, which compels the visual model to seek spatial cues to distinguish these hard negative captions. With the enhanced visual model, we introduce SpatialLLaVA, following the same LLaVA-1.5 training protocol, to investigate the importance of visual representations for MLLM's spatial intelligence. Furthermore, we create SpatialBench, a benchmark specifically designed to evaluate CLIP and MLLM in spatial reasoning. SpatialCLIP and SpatialLLaVA …
Poster
Andre Ye · Sebastin Santy · Jena D. Hwang · Amy X Zhang · Ranjay Krishna

[ ExHall D ]

Abstract
Most vision-language models today are primarily trained on English image-text pairs, with non-English pairs often filtered out. Evidence from cross-cultural psychology suggests that this approach will bias models against perceptual modes exhibited by people who speak other (non-English) languages.We investigate semantic and expressive variation in image captions across different languages; we analyze both human-annotated datasets and model-produced captions.By analyzing captions across seven languages (English, French, German, Russian, Chinese, Japanese, Korean) in high-quality image captioning datasets (Crossmodal and Visual Genome), we find that multilingual caption sets tend to provide richer visual descriptions than monolingual (including English-only) ones; multilingual sets contain 46.0% more objects66.1% more relationships, and66.8% more attributes.We observe the same results with multilingual captions produced by LLaVA and the Google Vertex API: for example, compared to monolingual captions, they cover21.9% more objects,18.8% more relations, and20.1% more attributes.These suggest that, across a large number of samples, different languages bias people and models to focus on different visual concepts.Finally, we show that models trained on image-text data in one language perform distinctly better on that language's test set.Our work points towards the potential value of training vision models on multilingual data sources to widen the range/variation of descriptive information those models are …
Poster
Quanxing Zha · Xin Liu · Shu-Juan Peng · Yiu-ming Cheung · Xing Xu · Nannan Wang

[ ExHall D ]

Abstract
Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://anonymous.4open.science/r/ReCon-NCL.
Poster
Yan Shu · Zheng Liu · Peitian Zhang · Minghao Qin · Junjie Zhou · Zhengyang Liang · Tiejun Huang · Bo Zhao

[ ExHall D ]

Abstract
Long video understanding poses a significant challenge for current Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained by their limited context lengths and the substantial costs while processing long videos. Although several existing methods attempt to reduce visual tokens, their strategies encounter severe bottleneck, restricting MLLMs' ability to perceive fine-grained visual details. In this work, we propose Video-XL, a novel approach that leverages MLLMs' inherent key-value (KV) sparsification capacity to condense the visual input. Specifically, we introduce a new special token, the Visual Summarization Token (VST), for each interval of the video, which summarizes the visual information within the interval as its associated KV. The VST module is trained by instruction fine-tuning, where two optimizing strategies are offered. 1. Curriculum learning, where VST learns to make small (easy) and large compression (hard) progressively. 2. Composite data curation, which integrates single-image, multi-image, and synthetic data to overcome the scarcity of long-video instruction data. The compression quality is further improved by dynamic compression, which customizes compression granularity based on the information density of different video intervals. Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes across multiple popular …
Poster
Lan Wang · Wei Ao · Vishnu Naresh Boddeti · Ser-Nam Lim

[ ExHall D ]

Abstract
Composed Image Retrieval (CIR) is a vision-language task utilizing queries comprising images and textual descriptions to achieve precise image retrieval. This task seeks to find images that are visually similar to a reference image and incorporate specific changes or features described textually (visual delta). CIR enables a more flexible and user-specific retrieval by bridging visual data with verbal instructions. This paper introduces a novel generative method that augments Composed Image Retrieval by Composed Image Generation (CIG) to provide pseudo-target images. CIG utilizes a textual inversion network to map reference images into semantic word space, which generates pseudo-target images in combination with textual descriptions. These images serve as additional visual information, significantly improving the accuracy and relevance of retrieved images when integrated into existing retrieval frameworks. Experiments conducted across multiple CIR datasets and several baseline methods demonstrate improvements in retrieval performance, which shows the potential of our approach as an effective add-on for existing composed image retrieval.
Poster
Yuhao Wang · Yongfeng Lv · Pingping Zhang · Huchuan Lu

[ ExHall D ]

Abstract
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Additionally, current methods often directly aggregate multi-modal features without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
Poster
Ziwei Wang · Weizhi Chen · Leyang Yang · Sheng Zhou · Shengchu Zhao · Hanbei Zhan · Jiongchao Jin · Liangcheng Li · Zirui Shao · Jiajun Bu

[ ExHall D ]

Abstract
Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. The rapid development of multi-modal large language models (MLLMs) in recent years has revealed their significant potential in GUI understanding. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current MLLMs already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To tackle these challenges, this paper presents MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modality from GUIs, with spatial structure enhancing strategy and adaptively combined via a fusion gate to meet the distinct requirements of different GUI interpretation tasks. To cope with the scarcity of high-quality data, we also introduce a pipeline for automatically collecting spatial information. Our extensive experiments demonstrate that MP-GUI achieves impressive results on numerous GUI understanding tasks even with a limited amount of generated data.
Poster
Hao Guo · Xugong Qin · Jun Jie Ou Yang · peng zhang · Gangyan Zeng · Yubo Li · Hailun Lin

[ ExHall D ]

Abstract
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieves documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios when using fine-grained semantics from text queries. To bridge this gap, this paper introduces a new benchmark of Natural Language-based Document Image Retrieval (NL-DIR) along with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We propose a two-stage retrieval method for DIR that enhances retrieval performance while optimizing both time and space efficiency. Furthermore, we perform zero-shot and fine-tuning evaluations of existing contrastive vision-language models and OCR-free visual document understanding (VDU) models on this dataset. The datasets and codes will be publicly available to facilitate research in the VDU community.
Poster
Yuhao Cui · Xinxing Zu · Wenhua Zhang · Zhongzhou Zhao · Jinyang Gao

[ ExHall D ]

Abstract
Leveraging Large Language Models (LLMs) for text representation has achieved significant success, but the exploration of using Multimodal LLMs (MLLMs) for multimodal representation remains limited. Previous MLLM-based representation studies have primarily focused on unifying the embedding space while neglecting the importance of multimodal alignment. As a result, their cross-modal retrieval performance falls markedly behind that of the CLIP series models. To address this, in our work, we 1) construct DeKon5M, a contrastive learning dataset enriched with dense multimodal knowledge, which efficiently enhances multimodal alignment capabilities in representation tasks. 2) design a framework for training unified representation on MLLMs. Building upon this unified representation framework and the dense knowledge dataset DeKon5M, we developed the dense knowledge representation model DeKR on Qwen2VL. Through extensive quantitative and qualitative experiments, our results demonstrate that DeKR not only aligns text, image, video, and text-image combinations within a unified embedding space but also achieves cross-modal retrieval performance comparable to SoTA CLIP series models. This fully validates the effectiveness of our approach and provides new insights for multimodal representation research.
Poster
Ziyang Zhang · Yang Yu · Yucheng Chen · Xulei Yang · Si Yong Yeo

[ ExHall D ]

Abstract
Despite significant progress in Vision-Language Pre-training (VLP), existing VLP approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This misalignment constrains the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose \textbf{MedUnifier}, a unified vision-language pre-training framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. It also offers a highly adaptable tool designed for a broad spectrum of language and vision tasks in healthcare, marking advancement toward the development of a genuinely generalizable AI model for medical contexts.
Poster
w l · Qingsong Wang · Yueying Feng · Shulei Wang · Tao Jin · Zhou Zhao · Fei Wu · Chang Yao · Jingyuan Chen

[ ExHall D ]

Abstract
Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types of comprehension tasks (classification, retrieval, captioning, and question answering) on …
Poster
Hao Li · Changyao TIAN · Jie Shao · Xizhou Zhu · Zhaokai Wang · Jinguo Zhu · Wenhan Dou · Xiaogang Wang · Hongsheng Li · Lewei Lu · Jifeng Dai

[ ExHall D ]

Abstract
The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, effectively supporting high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.
Poster
Shaoan Xie · Lingjing Kong · Yujia Zheng · Yu Yao · Zeyu Tang · Eric P. Xing · Guangyi Chen · Kun Zhang

[ ExHall D ]

Abstract
Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning.However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard.On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts.In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our …
Poster
Haicheng Wang · Chen Ju · Weixiong Lin · Mengting Chen · Shuai Xiao · Yixuan Huang · Chang Liu · mingshuai Yao · Jinsong Lan · Ying Chen · Qingwen Liu · Yanfeng Wang

[ ExHall D ]

Abstract
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks. Code for holistic CLIP will be released upon publication, to further promote the prosperity of VLMs.
Poster
Fang Liu · Yuhao Liu · Ke Xu · Shuquan Ye · Gerhard Hancke · Rynson W.H. Lau

[ ExHall D ]

Abstract
Salient Object Ranking (SOR) aims to study human attention shifts across different objects in the scene. It is a challenging task, as it requires comprehension of the relations among the salient objects in the scene. However, existing works often overlook such relations or model them implicitly. In this work, we observe that when Large Vision-Language Models (LVLMs) describe a scene, they usually focus on the most salient object first, and then discuss the relations as they move on to the next (less salient) one. Based on this observation, we propose a novel Language-Guided Salient Object Ranking approach (named LG-SOR), which utilizes the internal knowledge within the LVLM-generated language descriptions, i.e., semantic relation cues and the implicit entity order cues, to facilitate saliency ranking. Specifically, we first propose a novel Text-Guided Visual Modulation (TGVM) module to incorporate semantic information in the description for saliency ranking. TGVM controls the flow of linguistic information to the visual features, suppresses noisy background image features, and enables propagation of useful textual features. We then propose a novel Text-Aware Visual Reasoning (TAVR) module to enhance model reasoning in object ranking, by explicitly learning a multimodal graph based on the entity and relation cues derived from the …
Poster
Runhui Huang · Xinpeng Ding · Chunwei Wang · Jianhua Han · Yulong Liu · Hengshuang Zhao · Hang Xu · Lu Hou · Wei Zhang · Xiaodan Liang

[ ExHall D ]

Abstract
High-resolution image inputs allow Large Vision-Language Models (LVLMs) to capture finer visual details, improving comprehension. However, the increased training and computational costs associated with such inputs pose significant challenges. A common approach to mitigate these costs involves slicing the input into uniform patches using sliding windows, each aligned with the vision encoder’s input size. While efficient, this method fragments the input, disrupting the continuity of context, which negatively impacts cross-patch perception tasks. To address these limitations, we propose HiRes-LLaVA, a novel framework designed to efficiently process high-resolution inputs of any size without altering the original contextual and geometric information. HiRes-LLaVA introduces two key components: (i) a SliceRestore Adapter (SRA) that reconstructs sliced patches into their original form, enabling efficient extraction of both global and local features through down-up-sampling and convolutional layers, and (ii) a Self-Mining Sampler (SMS) that compresses visual tokens based on internal relationships, preserving original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related tasks. Extensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and EntityGrid-QA. For example, with SRA, our method achieves a performance improvement of ∼ 12% …
Poster
Yuchu Jiang · Jiale Fu · chenduo hao · Xinting Hu · Yingzhe Peng · Xin Geng · Xu Yang

[ ExHall D ]

Abstract
Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors'' added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://anonymous.4open.science/r/MimIC/.
Poster
Xubing Ye · Yukang Gan · Xiaoke Huang · Yixiao Ge · Yansong Tang

[ ExHall D ]

Abstract
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method can achieve a 576 times compression rate while maintaining 83.7% performance. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks.Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications.
Poster
Mayug Maniparambil · Raiymbek Akshulakov · YASSER ABDELAZIZ DAHOU DJILALI · Sanath Narayan · Ankit Singh · Noel O'Connor

[ ExHall D ]

Abstract
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Are semantically similar embedding spaces separated only by simple projection transformations? To validate this, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on various tasks involving both strong unimodal vision (0-shot localization) and language encoders (multi-lingual, long context) and show that simple Projectors retain unimodal capabilities in joint embedding space. Furthermore, our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76% accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared to multimodal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets will be released soon
Poster
Sudong Wang · Yunjian Zhang · Yao Zhu · Jianing Li · Zizhe Wang · Yanwei Liu · Xiangyang Ji

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms.
Poster
Shiyu Zhao · Zhenting Wang · Felix Juefei-Xu · Xide Xia · Miao Liu · Xiaofang Wang · Mingfu Liang · Ning Zhang · Dimitris N. Metaxas · Licheng Yu

[ ExHall D ]

Abstract
Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our …
Poster
ziang yan · Zhilin Li · Yinan He · Chenting Wang · Kunchang Li · Xinhao Li · Xiangyu Zeng · Zilei Wang · Yali Wang · Yu Qiao · Limin Wang · Yi Wang

[ ExHall D ]

Abstract
Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6\% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models.
Poster
Eunkyu Park · Minyeong Kim · Gunhee Kim

[ ExHall D ]

Abstract
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications.
Poster
Wei Suo · Lijun Zhang · Mengyang Sun · Lin Yuanbo Wu · Peng Wang · Yanning Zhang

[ ExHall D ]

Abstract
Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Our code will be released.
Poster
Wenbin An · Feng Tian · Sicong Leng · Jiahao Nie · Haonan Lin · QianYing Wang · Ping Chen · Xiaoqin Zhang · Shijian Lu

[ ExHall D ]

Abstract
Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our data and code will be released.
Poster
Zenghui Yuan · Jiawen Shi · Pan Zhou · Neil Zhenqiang Gong · Lichao Sun

[ ExHall D ]

Abstract
Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: Token-substitution and Token-addition, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.
Poster
Joonhyun Jeong · Seyun Bae · Yeonsung Jung · Jaeryong Hwang · Eunho Yang

[ ExHall D ]

Abstract
Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety-alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety-alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety-alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety-alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective …
Poster
Han Wang · Gang Wang · Huan Zhang

[ ExHall D ]

Abstract
Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks.Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against both unseen attacks at design time (i.e., structured-based attacks) and adversarial images …
Poster
Lijun Sheng · Jian Liang · Zilei Wang · Ran He

[ ExHall D ]

Abstract
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional visual models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and is often difficult to generalize across tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in supplementary materials.
Poster
Jinhong Deng · Yuhang Yang · Wen Li · Lixin Duan

[ ExHall D ]

Abstract
While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. …
Poster
Jeonghyeon Kim · Sangheum Hwang

[ ExHall D ]

Abstract
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing …
Poster
Matteo Farina · Massimiliano Mancini · Giovanni Iacca · Elisa Ricci

[ ExHall D ]

Abstract
An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the “base” classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories …
Poster
Lihua Zhou · Mao Ye · Shuaifeng Li · Nianxin Li · Xiatian Zhu · Lei Deng · Hongbin Liu · Zhen Lei

[ ExHall D ]

Abstract
Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbf{B}ayesian \textbf{C}lass \textbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.
Poster
Seung Hyun Lee · Jijun jiang · Yiran Xu · Zhuofang Li · Junjie Ke · Yinxiao Li · Junfeng He · Steven Hickson · Katie Datsenko · Sangpil Kim · Ming-Hsuan Yang · Irfan Essa · Feng Yang

[ ExHall D ]

Abstract
The goal of image cropping is to identify visually appealing crops in an image. Conventional methods are trained on specific datasets and fail to adapt to new requirements. Recent breakthroughs in large vision-language models (VLMs) enable visual in-context learning without explicit training. However, downstream tasks with VLMs remain under explored. In this paper, we propose an effective approach to leverage VLMs for image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, we refer to as Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.
Poster
Haoyuan Yang · Xiaoou Li · Jiaming Lv · Xianjun Cheng · Qilong Wang · Peihua Li

[ ExHall D ]

Abstract
Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution. However, existing approaches treat synthetic images merely as complements to real images, rather than as standalone knowledge repositories stemming from distinct foundation models. To overcome this limitation, we reconceptualize synthetic images as an *imagined base set*, i.e., a unique, large-scale synthetic dataset encompassing diverse concepts. We introduce a novel CLIP adaptation methodology called *ImagineFSL*, involving pretraining on the imagined base set followed by fine-tuning on downstream few-shot tasks. We find that, compared to no pretraining, both supervised and self-supervised pretraining are beneficial, with the latter providing better performance. Building on this finding, we propose an improved self-supervised method tailored for few-shot scenarios, enhancing the transferability of representations from synthetic to real image domains. Additionally, we present an image generation pipeline that employs chain-of-thought and in-context learning techniques, harnessing foundation models to automatically generate diverse, realistic images. Our methods are validated across eleven datasets, consistently outperforming state-of-the-art methods by substantial margins.
Poster
Chenyu Zhang · Kunlun Xu · Zichen Liu · Yuxin Peng · Jiahuan Zhou

[ ExHall D ]

Abstract
Vision-language models (VLMs) exhibit promising generalization capabilities, yet face considerable challenges when adapting to domain shifts stemming from changes in data distributions. Test-time adaptation (TTA) has thus emerged as a promising approach for enhancing VLM performance under such conditions. In practice, test data often arrives in batches, which has led to increasing interest in the transductive TTA setting. Existing TTA methods, however, are typically limited by focusing solely on individual test samples, thereby overlooking the critical cross-sample correlations within a batch. While recent ViT-based TTA methods have started to incorporate batch-level adaptation, they remain suboptimal for VLMs due to insufficient integration of the essential text modality. To bridge key gaps in TTA for VLMs, we propose a novel transductive TTA framework called Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first unsupervisedly forms supportive cliques of test samples based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically …
Poster
Ruoyu Chen · Siyuan Liang · Jingzhi Li · Shiming Liu · Maosen Li · Zhen Huang · Hua Zhang · Xiaochun Cao

[ ExHall D ]

Abstract
Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models’ decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7\%, 31.6\%, and 20.1\% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9\% and 66.9\% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing …
Poster
ZHANG LINTONG · Kang Yin · Seong-Whan Lee

[ ExHall D ]

Abstract
Attribution-based explanation techniques capture key patterns to enhance visual interpretability. However, these patterns often lack the granularity needed for insight in fine-grained tasks, particularly in cases of model misclassification, where explanations may be insufficiently detailed. To address this limitation, we propose a fine-grained counterfactual explanation framework that generates both object-level and part-level interpretability, addressing two fundamental questions: (1) which fine-grained features contribute to model misclassification, and (2) where dominant local features influence counterfactual adjustments. Our approach yields explainable counterfactuals in a non-generative manner by quantifying similarity and weighting component contributions within regions of interest between correctly classified and misclassified samples. Furthermore, we introduce an importance-isolation module grounded in Shapley value contributions, isolating features with region-specific relevance. Extensive experiments demonstrate the superiority of our approach in capturing more granular, intuitively meaningful regions, surpassing coarse-grained methods.
Poster
Itay Benou · Tammy Riklin Raviv

[ ExHall D ]

Abstract
Modern deep neural networks have now reached human-level performance across a variety of tasks. However, unlike humans they lack the ability to explain their decisions by showing where and telling what concepts guided them. In this work, we present a unified framework for transforming any vision neural network into a spatially and conceptually interpretable model. We introduce a spatially-aware concept bottleneck layer that projects “black-box” features of pre-trained backbone models into interpretable concept maps, without requiring human labels. By training a classification layer over this bottleneck, we obtain a self-explaining model that articulates which concepts most influenced its prediction, along with heatmaps that ground them in the input image. Accordingly, we name this method “Spatially-Aware and Label-Free Concept Bottleneck Model” (SALF-CBM). Our results show that the proposed SALF-CBM: (1) Outperforms non-spatial CBM methods, as well as the original backbone, on a variety of classification tasks; (2) Produces high-quality spatial explanations, outperforming widely used heatmap-based methods on a zero-shot segmentation task; (3) Facilitates model exploration and debugging, enabling users to query specific image regions and refine the model's decisions by locally editing its concept maps.
Poster
Jinseong Jang · Chunfei Ma · Byeongwon Lee

[ ExHall D ]

Abstract
Deploying high-performing neural networks in resource-constrained environments poses a significant challenge due to the computational demands of large-scale models. We introduce VL2Lite, a knowledge distillation framework designed to enhance the performance of lightweight neural networks in image classification tasks by leveraging the rich representational knowledge from Vision-Language Models (VLMs). VL2Lite directly integrates multi-modal knowledge from VLMs into compact models during training, effectively compensating for the limited computational and modeling capabilities of smaller networks. By transferring high-level features and complex data representations, our approach improves the accuracy and efficiency of image classification tasks without increasing computational overhead during inference. Experimental evaluations demonstrate that VL2Lite achieves up to a 7% improvement in classification performance across various datasets. This method addresses the challenge of deploying accurate models in environments with constrained computational resources, offering a balanced solution between model complexity and operational efficiency.
Poster
Mert Bülent Sarıyıldız · Philippe Weinzaepfel · Thomas Lucas · Pau de Jorge · Diane Larlus · Yannis Kalantidis

[ ExHall D ]

Abstract
Recent multi-teacher distillation methods have successfully unified the encoders of several foundation models into a single encoder capable of competitive performance on core computer vision tasks, such as classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation -- a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore strategies for data sharing and encoding teacher-specific information and as a result, we obtain a single encoder that excels in challenging tasks spanning 3D understanding, 3D human perception, and 2D vision. The resulting model exhibits strong generalization capabilities and performs on par with its teachers, each one state-of-the-art for a specialized task. Notably, our model outperforms all known methods on the Map-free Visual Relocalization dataset with a highly compact encoder.
Poster
Xuweiyi Chen · Markus Marks · Zezhou Cheng

[ ExHall D ]

Abstract
Mid-level vision capabilities — such as generic object localization and 3D geometric understanding — are not only fundamental to human vision but are also crucial for many real-world applications of computer vision.These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined.In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.
Poster
Seokil Ham · Hee-Seon Kim · Sangmin Woo · Changick Kim

[ ExHall D ]

Abstract
Despite the growing interest in Mamba architecture as a potential replacement for Transformer architecture, parameter-efficient fine-tuning (PEFT) approaches for Mamba remain largely unexplored. In our study, we introduce two key insights-driven strategies for PEFT in Mamba architecture: (1) While state-space models (SSMs) have been regarded as the cornerstone of Mamba architecture, then expected to play a primary role in transfer learning, our findings reveal that Projectors---not SSMs---are the predominant contributors to transfer learning, and (2) Based on our observation that adapting pretrained Projectors to new tasks can be effectively approximated through a near-diagonal linear transformation, we propose a novel PEFT method specialized to Mamba architecture: Projector-targeted Diagonal-centric Linear Transformation (ProDiaL). ProDiaL focuses on optimizing only diagonal-centric linear transformation matrices, without directly fine-tuning the pretrained Projector weights. This targeted approach allows efficient task adaptation, utilizing less than 1% of the total parameters, and exhibits strong performance across both vision and language Mamba models, highlighting its versatility and effectiveness.
Poster
Jinqi Xiao · Shen Sang · Tiancheng Zhi · Jing Liu · Qing Yan · Linjie Luo · Bo Yuan

[ ExHall D ]

Abstract
Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61\% with only 2\% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81\% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
Poster
Elad Amrani · Leonid Karlinsky · Alex M. Bronstein

[ ExHall D ]

Abstract
We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data,auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k×k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction.This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152× fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7–16× fewer parameters (85M vs. 1.36B/0.63B).
Poster
Jeimin Jeon · Youngmin Oh · Junghyup Lee · Donghyeon Baek · Dohyung Kim · Chanho Eom · Bumsub Ham

[ ExHall D ]

Abstract
N-shot neural architecture search (NAS) exploits a supernet containing all candidate subnets for a given search space. The subnets are typically trained with a static training strategy (e.g., using the same learning rate (LR) scheduler and optimizer for all subnets). This, however, does not consider that individual subnets have distinct characteristics, leading to two problems: (1) The supernet training is biased towards the low-complexity subnets (unfairness); (2) the momentum update in the supernet is noisy (noisy momentum). We present a dynamic supernet training technique to address these problems by adjusting the training strategy adaptive to the subnets. Specifically, we introduce a complexity-aware LR scheduler (CaLR) that controls the decay ratio of LR adaptive to the complexities of subnets, which alleviates the unfairness problem. We also present a momentum separation technique (MS). It groups the subnets with similar structural characteristics and uses a separate momentum for each group, avoiding the noisy momentum problem. Our approach can be applicable to various N-shot NAS methods with marginal cost, while improving the search performance drastically. We validate the effectiveness of our approach on various search spaces (e.g., NAS-Bench-201, Mobilenet spaces) and datasets (e.g., CIFAR-10/100, ImageNet). Our code will be available online.
Poster
Sabbir Ahmed · Abdullah Al Arafat · Deniz Najafi · Akhlak Mahmood · Mamshad Nayeem Rizve · Mohaiminul Al Nahian · RANYANG ZHOU · Shaahin Angizi · Adnan Rakin Rakin

[ ExHall D ]

Abstract
Vision Transformers (ViTs) excel in tackling complex vision tasks, yet their substantial size poses significant challenges for applications on resource-constrained edge devices. The increased size of these models leads to higher overhead (e.g., energy, latency) when transmitting model weights between the edge device and the server. Hence, ViTs are not ideal for edge devices where the entire model may not fit on the device. Current model compression techniques often achieve high compression ratios at the expense of performance degradation, particularly for ViTs. To overcome the limitations of existing works, we rethink model compression strategy for ViTs from first principle approach and develop an orthogonal strategy called DeepCompress-ViT. The objective of the DeepCompress-ViT is to encode the model weights to a highly compressed encoded representation using a novel training method, denoted as Unified Compression Training (UCT). Proposed UCT is accompanied by a decoding mechanism during inference, which helps to gain any loss of accuracy due to high compression ratio. We further optimize this decoding step by re-ordering the decoding operation using associative property of matrix multiplication, ensuring that the compressed weights can be decoded during inference without incurring any computational overhead. Our extensive experiments across multiple ViT models on modern edge …
Poster
Minhyeok Lee · Suhwan Cho · Jungho Lee · Sunghun Yang · Heeseung Choi · Ig-Jae Kim · Sangyoun Lee

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. Additionally, a Vision-Language Fusion (VLF) module enhances the final mask prediction through image and text guidance. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
Poster
Feng Wang · Timing Yang · Yaodong Yu · Sucheng Ren · Guoyizhe Wei · Angtian Wang · Wei Shao · Yuyin Zhou · Alan L. Yuille · Cihang Xie

[ ExHall D ]

Abstract
In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT and Vim, Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8x and 6.2x faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images.
Poster
Yair Smadar · Assaf Hoogi

[ ExHall D ]

Abstract
Deep neural networks remain vulnerable to statistical variations in data despite advances in normalization techniques. Current approaches rely on fixed static normalization sets, fundamentally limiting their ability to adapt to dynamic data distributions. We introduce Dynamic Group Normalization (DGN), which treats channel grouping as a learnable component and leverages statistical awareness to form coherent groups adaptively. By employing an efficient spatio-temporal mechanism that continuously evaluates inter-channel relationships both within layers and across training epochs, DGN enables robust adaptation to evolving data distributions.Extensive evaluations across 24 architectures and 8 computer vision benchmarks demonstrate DGN's consistent superiority. Beyond achieving significant accuracy gains in classification, detection, and segmentation tasks while maintaining computational efficiency, DGN particularly excels in challenging scenarios where traditional methods struggle—notably in Out-Of-Distribution generalization and imbalanced data distributions.
Poster
Linwei Chen · Lin Gu · Liang Li · Chenggang Yan · Ying Fu

[ ExHall D ]

Abstract
While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability.In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content.Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M).Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern …
Poster
Kwonyoung Kim · Jungin Park · Jin Kim · Hyeongjun Kwon · Kwanghoon Sohn

[ ExHall D ]

Abstract
Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. In practice, PET requires much smaller storage and transmission cost for each task than traditional fine-tuning methods, which require updating whole parameters, regardless of exponentially increasing pre-trained model capacity. However, most existing PET methods inherit the latency associated with their large backbones and often require additional computation due to additional modules (e.g. adapter) during inference, making them less practical on computation-intensive applications. In this paper, we propose a Faster Parameter-Efficient Tuning (FPET) method to achieve high inference speed and computation efficiency while keeping storage efficiency high. Specifically, we introduce a plug-and-play token redundancy reduction module delicately engineered for PET. The proposed module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the token count through a token merging strategy. We formulate token merging to be fully differentiable using a straight-through estimator, making token redundancy reduction optimal. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.
Poster
Yan Xie · Zequn Zeng · Hao Zhang · Yucheng Ding · Yi Wang · Zhengjue Wang · Bo Chen · Hongwei Liu

[ ExHall D ]

Abstract
Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations, hence decreasing model reliability; and ii) though CBMs could explain the importance of every concept to the final prediction, it is still challenging to tell which visual region produces the prediction. To solve these problems, this paper proposes a Disentangled Optimal Transport CBM (DOT-CBM) framework to explore fine-grained visual-concept relations between local image patches and concepts. Specifically, we model the concept prediction process as a transportation problem between the patches and concepts, thereby achieving explicit fine-grained feature alignment. We also incorporate orthogonal projection losses within the modality to enhance local feature disentanglement. To further address the shortcut issues caused by statistical biases in the data, we utilize the visual saliency map and concept label statistics as transportation priors. Thus, DOT-CBM can visualize inversion heatmaps, provide more reliable concept predictions, and produce more accurate class predictions.Comprehensive experiments demonstrate that our proposed DOT-CBM achieves SOTA performance on …
Poster
Aishwarya Agarwal · Srikrishna Karanam · Vineet Gandhi

[ ExHall D ]

Abstract
We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on …
Poster
Zihang Lai

[ ExHall D ]

Abstract
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, (iii) custom grouping encoders, and (iv) the Segment Anything Model (SAM). In this paper, we introduce S-Seg, a simple model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improving when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research. Our code and demo will be made publicly available soon.
Poster
Lanyun Zhu · Tianrun Chen · Qianxiong Xu · Xuanyi Liu · Deyi Ji · Haiyang Wu · De Soh Soh · Jun Liu

[ ExHall D ]

Abstract
Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM.
Poster
Songsong Duan · Xi Yang · Nannan Wang

[ ExHall D ]

Abstract
Existing Weakly Supervised Semantic Segmentation (WSSS) relies on the CNN-based Class Activation Map (CAM) and Transformer-based self-attention map to generate class-specific masks for semantic segmentation. However, CAM and self-attention maps usually cause incomplete segmentation due to classification bias issue. To address this issue, we propose a Multi-Label Prototype Visual Spatial Search (MuP-VSS) method with a spatial query mechanism, which learns a set of learnable class token vectors as queries to search the similarity visual tokens from image patch tokens. Specifically, MuP-VSS consists of two key components: \textbf{multi-label prototype representation} and \textbf{multi-label prototype optimization}. The former designs a global embedding to learn the global tokens from the images, and then proposes a Prototype Embedding Module (PEM) to interact with patch tokens to understand the local semantic information. The latter utilizes the exclusivity and consistency principles of the multi-label prototypes to design three prototype losses to optimize them, which contain cross-class prototype (CCP) contrastive loss, cross-image prototype (CIP) contrastive loss, and patch-to-prototype (P2P) consistency loss. CCP loss models exclusivity of multi-label prototypes learned from a single image to enhance the discriminative properties of each class better. CCP loss learns the consistency of the same class-specific prototypes extracted from multiple images to enhance …
Poster
Farchan Hakim Raswa · Chun-Shien Lu · Jia-Ching Wang

[ ExHall D ]

Abstract
Federated learning for pathological whole slide image (WSI) classification allows multiple clients to train a global multiple instance learning (MIL) model without sharing their privacy-sensitive WSIs.To accommodate the non-independent and identically distributed (non-i.i.d.) feature shifts, cross-client style transfer has been popularly used but is subject to two fundamental issues: (1) WSIs contain multiple morphological structures due to tissue heterogeneity, and (2) the region of interests (RoIs) is not guaranteed, particularly after augmenting local WSIs data trough style transfer. To address these challenges, we propose HistoFS, a federated learning framework for computational pathology on non-i.i.d. feature shifts in WSI classification. Specifically, we introduce pseudo bag styles that capture multiple style variations within a single WSI. In addition, an authenticity module is introduced to ensure that RoIs are preserved, allowing local models to learn WSIs with diverse styles while maintaining essential RoIs. Extensive experiments validate the superiority of HistoFS over state-of-the-art methods on three clinical datasets.
Poster
Ziqian Yang · Xinqiao Zhao · Xiaolei Wang · Quan Zhang · Jimin Xiao

[ ExHall D ]

Abstract
Image-level Weakly Supervised Semantic Segmentation (WSSS) has garnered significant attention due to its low annotation costs. Current single-stage state-of-the-art WSSS methods mainly reply on ViT to extract features from input images, generating more complete segmentation results based on comprehensive semantic information. However, these ViT-based methods often suffer from over-smoothing issues in segmentation results. In this paper, we identify that attenuated high-frequency features mislead the decoder of ViT-based WSSS models, resulting in over-smoothed false segmentation. To address this, we propose a Frequency Feature Rectification (FFR) framework. Quantitative and qualitative experimental results demonstrate that our FFR framework can effectively address the attenuated high-frequency caused over-smoothed segmentation issue and achieve new state-of-the-art WSSS performances. Codes will be released.
Poster
Qingchen Tang · Lei Fan · Maurice Pagnucco · Yang Song

[ ExHall D ]

Abstract
Weakly supervised image segmentation with image-level labels has drawn attention due to the high cost of pixel-level annotations. Traditional methods using Class Activation Maps (CAMs) often highlight only the most discriminative regions, leading to incomplete masks. Recent approaches that introduce textual information struggle with histopathological images due to inter-class homogeneity and intra-class heterogeneity. In this paper, we propose a prototype-based image prompting framework for histopathological image segmentation. It constructs an image bank from the training set using clustering, extracting multiple prototype features per class to capture intra-class heterogeneity. By designing a matching loss between input features and class-specific prototypes using contrastive learning, our method addresses inter-class homogeneity and guides the model to generate more accurate CAMs. Experiments on four datasets (LUAD-HistoSeg, BCSS-WSSS, GCSS, and BCSS) show that our method outperforms existing weakly supervised segmentation approaches, setting new benchmarks in histopathological image segmentation.
Poster
Pinzhuo Tian · Shengjie Yang · Hang Yu · Alex C. Kot

[ ExHall D ]

Abstract
The slot attention-based method is widely used in unsupervised object-centric learning, which aims to decompose scenes into interpretable objects and associate them with slots. However, complex backgrounds in the real images can disrupt the model’s focus, leading it to excessively segment background stuff into different regions based on low-level information such as color or texture variations. Consequently, the elaborate segmentation of foreground objects will be neglected, which requires detailed shape or geometric information.To address this issue, we introduce a contrastive learning-based indicator designed to differentiate between foreground and background. Integrating this indicator into the slot attention-based method allows the model to focus more effectively on segmenting foreground objects and minimize background distractions. During the testing phase, we utilize a spectral clustering mechanism to refine the results and mitigate oversegmentation according to the similarity between the slots.Experimental results show that incorporating our method with various state-of-the-art models significantly improves their performance on both simulated data and real-world datasets. Furthermore, multiple sets of ablation experiments confirm the effectiveness of each proposed component. Our code will be made available.
Poster
Jianyang Zhang · Qianli Luo · Guowu Yang · Wenjing Yang · Weide Liu · Guosheng Lin · Fengmao Lv

[ ExHall D ]

Abstract
Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 8 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept set will be …
Poster
Peng Wu · Xiankai Lu · Hao Hu · Yongqin Xian · Jianbing Shen · Wenguan Wang

[ ExHall D ]

Abstract
Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning the primitive concepts (*i.e.*, attribute and object) from the training set. While recent works achieve impressive results in CZSL by leveraging large vision-language models like CLIP, they ignore the rich semantic relationships between primitive concepts and their compositions. In this work, we propose LOGICZSL, a novel logic-induced learning framework to explicitly model the semantic relationships. Our logic-induced learning framework formulates the relational knowledge constructed from large language models as a set of logic rules, and grounds them onto the training data. Our logic-induced losses are complementary to the widely used CZSL losses, therefore can be employed to inject the semantic information into any existing CZSL methods. Extensive experimental results show that our method brings significant performance improvements across diverse datasets (*i.e.*, CGQA, UT-Zappos50K, MIT-States) with strong CLIP-based methods and settings (*i.e.*, Close World, Open World). Codes will be publicly released.
Poster
Xiaokun Li · Yaping Huang · Qingji Guan

[ ExHall D ]

Abstract
Fine-grained open-set semi-supervised learning (OSSL) investigates a practical scenario where unlabeled data may contain fine-grained out-of-distribution (OOD) samples. Due to the subtle visual differences among in-distribution (ID) samples, as well as between ID and OOD samples, it is extremely challenging to separate ID and OOD samples. Recent Vision-Language Models, such as CLIP, have shown excellent generalization capabilities. However, it tends to focus on general attributes, and thus is insufficient to distinguish the fine-grained details. To tackle the issues, in this paper, we propose a novel CLIP-driven coarse-to-fine semantic-guided framework, named CFSG-CLIP, by progressively filtering and focusing the distinctive fine-grained clues. Specifically, CFSG-CLIP comprises a coarse-guidance module and a fine-guidance module derived from the pre-trained CLIP model. In the coarse-guidance module, we design a semantic filtering strategy to initially filter out local visual features guided by cross-modality guidance. Then, in the fine-guidance module, we further design a visual-semantic injection strategy, which embeds category-related visual cues into the visual encoder to further refine the local visual features. By the designed dual-guidance framework, the local subtle cues are progressively discovered to distinct the subtle difference between ID and OOD samples. Extensive experiments demonstrates that CFSG-CLIP is able to not only improve the reliability …
Poster
Wei Zhang · Baopeng Zhang · Zhu Teng · Wenxin Luo · Junnan Zou · Jianping Fan

[ ExHall D ]

Abstract
Generalized Category Discovery (GCD) typically relies on the pre-trained Vision Transformer (ViT) to extract features from a global receptive field, followed by contrastive learning to simultaneously classify unlabeled known classes and unknown classes without priors. Owing to the deficiency in the modeling capacity for inner-patch local information within ViT, current methods primarily focus on discriminative features at global level. This results in a model with more yet scattered attention, where neither excessive nor insufficient focus can grasp subtle differences to classify fine-grained unknown and known categories. To address this issue, we propose the AptGCD to deliver apt attention for GCD. It mimics the human brain how leveraging visual perception to refine local attention and comprehend global context by proposing a Meta Visual Prompt (MVP) and Prompt Transformer (PT). MVP is introduced into GCD for the first time, refining channel-level attention, while adaptively self-learning unique inner-patch features as prompts to achieve local visual modeling for our prompt transformer. Yet, relying solely on detailed features can lead to skewed judgments. Hence, PT harmonizes local and global representations, guiding the model's interpretation of features through broader contexts, thereby capturing more useful details with less attention. Extensive experiments on seven datasets demonstrate that AptGCD …
Poster
Shan Zhang · Yao Ni · Jinhao Du · Yuan Xue · Philip H.S. Torr · Piotr Koniusz · Anton van den Hengel

[ ExHall D ]

Abstract
The challenge in open-world object detection, as in many few- and zero-shot learning problems, is to generalize beyond the class distribution of the training data. We thus propose a general class-agnostic objectness measure to reduce bias toward labeled samples. To prevent previously unseen objects from being filtered as background or misclassified as known categories by classifers, we explicitly model the joint distribution of objectness and category labels using variational approximation. Without sufficient labeled data, minimizing the KL divergence between the estimated posterior and a static normal prior fails to converge, however. Theoretical analysis illuminates the root cause and motivates adopting a Gaussian prior with variance dynamically adapted to the estimated posterior as a surrogate. To further reduce misclassification, we introduce an energy-based margin loss that encourages unknown objects to move toward high-density regions of the distribution, thus reducing the uncertainty of unknown detections. We introduce an energy-based Open-World OBJectness modeling (OWOBJ) to boost novel object detection, especially in low-data settings. As a flexible plugin, OWOBJ outperforms baselines in Open-World, Few-Shot, and zero-shot Open-Vocabulary Object Detection. Code will be released upon acceptance.
Poster
Zhenya Tian · Jun Xiao · Liu lupeng · Haiyong Jiang

[ ExHall D ]

Abstract
This work tackles the challenge of 3D Class-Incremental Learning (CIL), where a model must learn to classify new 3D objects while retaining knowledge of previously learned classes. Existing methods often struggle with catastrophic forgetting, misclassifying old objects due to overreliance on shortcut local features. Our approach addresses this issue by learning a set of part concepts for part-aware features. Particularly, we only activate a small subset of part concepts for the feature representation of each part-aware feature. This facilitates better generalization across categories and mitigates catastrophic forgetting. We further improve the task-wise classification through a part relation-aware Transformer design. At last, we devise learnable affinities to fuse task-wise classification heads and avoid confusion among different tasks. We evaluate our method on three 3D CIL benchmarks, achieving state-of-the-art performance. (Code and data will be released)
Poster
Xiang Song · Yuhang He · Jingyuan Li · Qiang Wang · Yihong Gong

[ ExHall D ]

Abstract
In this paper, we focus on a challenging Incremental Object Detection (IOD) problem. Existing IOD methods follow an image-to-annotation alignment paradigm, which attempts to complete the annotations for old categories and subsequently learns both new and old categories in new tasks. This paradigm inherently introduces missing/redundant/inaccurate annotations of old categories, resulting in a suboptimal performance. Instead, we propose a novel annotation-to-instance alignment IOD paradigm and develop a corresponding method named Learning Endogenous Attention (LEA). Inspired by the human brain, LEA enables the model to focus on annotated task-specific objects, while ignoring irrelevant ones, thus solving the annotation incomplete problem in IOD. Concretely, our LEA consists of Endogenous Attention Modules (EAMs) and an Energy-based Task Modulator (ETM). During training, we add the dedicated EAMs for each new task and train them to focus on the new categories. During testing, ETM predicts task IDs using energy functions, directing the model to detect task-specific objects. The detection results corresponding to all task IDs are combined as the final output, thereby alleviating the catastrophic forgetting of old knowledge. Extensive experiments on COCO 2017 and Pascal VOC 2007 demonstrate the effectiveness of our method.
Poster
Weiqi Yan · Lvhai Chen · Huaijia Kou · Shengchuan Zhang · Yan Zhang · Liujuan Cao

[ ExHall D ]

Abstract
Unsupervised Camoflaged Object Detection (UCOD) has gained attention since it doesn't need to rely on extensive pixel-level labels. Existing UCOD methods typically generate pseudo-labels using fixed strategies and train 1×1 convolutional layers as a simple decoder, leading to low performance compared to fully-supervised methods. We emphasize two drawbacks in these approaches: 1). The model is prone to fitting incorrect knowledge due to the pseudo-label containing substantial noise. 2). The simple decoder fails to capture and learn the semantic features of camouflaged objects, especially for small-sized objects, due to the low-resolution pseudo-labels and severe confusion between foreground and background pixels. To this end, we propose a UCOD method with a teacher-student framework via Dynamic Pseudo-label Learning called UCOD-DPL, which contains an Adaptive Pseudo-label Module (APM), a Dual-Branch Adversarial (DBA) decoder, and a Look-Twice mechanism. The APM module adaptively combines pseudo-labels generated by fixed strategies and the teacher model to prevent the model from overfitting incorrect knowledge while preserving the ability for self-correction; the DBA decoder takes adversarial learning of different segmentation objectives, guides the model to overcome the foreground-background confusion of camouflaged objects, and the Look-Twice mechanism mimics the human tendency to zoom in on camouflaged objects and performs …
Poster
Jinghao Bian · Mingtao Feng · Weisheng Dong · Fangfang Wu · Jianqiao Luo · Yaonan Wang · Guangming Shi

[ ExHall D ]

Abstract
Tiny object detection remains challenging in spite of the success of generic detectors. The dramatic performance degradation of generic detectors on tiny objects is mainly due to the the weak representations of extremely limited pixels. To address this issue, we propose a plug-and-play architecture to enhance the extinguished regions. We for the first time exploit the regions to be enhanced from the perspective of pixel-wise amount of information. Specifically, we model the entire image pixels feature information by minimizing Information Entropy loss, generating an information map to attentively highlight weak activated regions in an unsupervised way. To effectively assist the above phase with more attention to tiny objects, we next introduce the Position Gaussian Distribution Map, explicitly modeled using a Gaussian Mixture distribution, where each Gaussian component's parameters depend on the position and size of object instance labels, serving as supervision for further feature enhancement. Taking the information map as prior knowledge guidance, we construct a multi-scale position gaussian distribution map prediction module, simultaneously modulating the information map and distribution map to focus on tiny objects during training. Extensive experiments on three public tiny object datasets demonstrate the superiority of our method over current state-of-the-art competitors. The code is available …
Poster
Vishesh Kumar · Akshay Agarwal

[ ExHall D ]

Abstract
Deep Neural Networks (DNNs), backbone architecture in almosteverycomputervisiontask,arev̲rabadversarialaacks,partica̲rlyphysicalout-of-distribution(OOD)adversarialpatches.Eξstgoftenstruggwitherpretgtheseaackswaystˆalignwithhumanvisualperception.OurosedAdvPatchXAIroducesaralized,robust,andexplaabdefensealgorithmspecificallydesigddefendDsagastphysicaladversarialthreats.AdvPatchXAIemploysanovelpatchdecorrelationlosstˆreducesfeatureredundancyandenhancesthedistctivessofpatchrepresentations,enablgbeerralizationacrossunseenadversarialscenarios.Itarnsprotyπcalpartsaself-rvisedfashion,enhancgerprηbilityandcorrelationwithhumanvision.Themodelutilizesasparselearlayerforclassification,makgthedecision-makgprocessgloballyerprηbthroughasetofardprotypesandlocallyexplaabbyπnpgrevantprotypeswithanima.OurcomprehensiveevaluationshowstˆAdvPatchXAI¬onlyclosesthe`semantic'' gap between latent space and pixel space but also effectively handles unseen adversarial patches even perturbed with unseen corruptions, thereby significantly advancing DNN robustness in practical settings.
Poster
Zhen Qu · Xian Tao · Xinyi Gong · ShiChen Qu · Qiyu Chen · Zhengtao Zhang · Xingang Wang · Guiguang Ding

[ ExHall D ]

Abstract
Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) Hand-crafted text prompts depend heavily on expert knowledge and require extensive trial and error; 2) The single-form learnable prompts is insufficient to capture the complex semantics of anomalies; and 3) The prompt space is poorly constrained, leading to suboptimal generalization performance on unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and enhance model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-attention (RCA) module is introduced to better align dynamic text embeddings with …
Poster
wenqiao Li · Yao Gu · Xintao Chen · Xiaohao Xu · Ming Hu · Xiaonan Huang · Yingna Wu

[ ExHall D ]

Abstract
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential.To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality.We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our dataset and benchmark will …
Poster
Ying Jin · Jinlong Peng · Qingdong He · Teng Hu · Jiafu Wu · Hao Chen · Haoxuan Wang · wenbing zhu · Mingmin Chi · Jun Liu · Yabiao Wang

[ ExHall D ]

Abstract
The performance of anomaly inspection in industrial manufacturing is constrained by the scarcity of anomaly data. To overcome this challenge, researchers have started employing anomaly generation approaches to augment the anomaly dataset. However, existing anomaly generation methods suffer from limited diversity in the generated anomalies and struggle to achieve a seamless blending of this anomaly with the original image. Moreover, the generated mask is usually not aligned with the generated anomaly. In this paper, we overcome these challenges from a new perspective, simultaneously generating a pair of the overall image and the corresponding anomaly part. We propose DualAnoDiff, a novel diffusion-based few-shot anomaly image generation model, which can generate diverse and realistic anomaly images by using a dual-interrelated diffusion model, where one of them is employed to generate the whole image while the other one generates the anomaly part. Moreover, we extract background and shape information to mitigate the distortion and blurriness phenomenon in few-shot image generation. Extensive experiments demonstrate the superiority of our proposed model over state-of-the-art methods in terms of diversity, realism and the accuracy of mask. Overall, our approach significantly improves the performance of downstream anomaly inspection tasks, including anomaly detection, anomaly localization, and anomaly classification tasks. …
Poster
Yusuke Matsui

[ ExHall D ]

Abstract
Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, it is desirable for search results to be similar to the query and diverse among themselves. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cut-off table summarizing vectors that are close to each other. During the filtering, LotusFilter greedy looks up the table to delete redundant vectors from the candidates. We demonstrated that the proposed filter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings.
Poster
Haokun Chen · Hang Li · Yao Zhang · Jinhe Bi · Gengyuan Zhang · Yueqi Zhang · Philip H.S. Torr · Jindong Gu · Denis Krompaß · Volker Tresp

[ ExHall D ]

Abstract
One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM's pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client's local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with …
Poster
Kai Wang · Zekai Li · Zhi-Qi Cheng · Samir Khaki · Ahmad Sajedi · Ramakrishna Vedantam · Konstantinos N. Plataniotis · Alexander G. Hauptmann · Yang You

[ ExHall D ]

Abstract
Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose a novel approach that \textbf{e}mphasizes the \textbf{d}iscriminative \textbf{f}eatures (obtained by Grad-CAM) for dataset distillation, called \textbf{EDF}. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. Notably, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to enhance the practicality and efficacy of DD. Our code and benchmark will be made public.
Poster
Xinhao Zhong · Hao Fang · Bin Chen · Xulin Gu · Meikang Qiu · Shuhan Qi · Shu-Tao Xia

[ ExHall D ]

Abstract
Dataset distillation is an emerging dataset reduction method, which condenses large-scale datasets while maintaining task accuracy. Current parameterization methods achieve enhanced performance under extremely high compression ratio by optimizing determined synthetic dataset in informative feature domain. However, they limit themselves to a fixed optimization space for distillation, neglecting the diverse guidance across different informative latent spaces. To overcome this limitation, we propose a novel parameterization method dubbed Hierarchical Parameterization Distillation (H-PD), to systematically explore hierarchical feature within provided feature space (e.g., layers within pre-trained generative adversarial networks). We verify the correctness of our insights by applying the hierarchical optimization strategy on GAN-based parameterization method. In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation, bridging the gap between synthetic and original datasets. Experimental results demonstrate that the proposed H-PD achieves a significant performance improvement under various settings with equivalent time consumption, and even surpasses current generative distillation using diffusion models under extreme compression ratios IPC=1 and IPC=10.
Poster
Weixiang Zhang · Shuzhao Xie · Chengwei Ren · Siyi Xie · Chen Tang · Shijia Ge · Mingzi Wang · Zhi Wang

[ ExHall D ]

Abstract
We propose EVOlutionary Selector (EVOS), an efficient training paradigm for accelerating Implicit Neural Representation (INR). Unlike conventional INR training that feeds all samples through the neural network in each iteration, our approach restricts training to strategically selected points, reducing computational overhead by eliminating redundant forward passes.Specifically, we treat each sample as an individual in an evolutionary process, where only those fittest ones survive and merit inclusion in training, adaptively evolving with the neural network dynamics. While this is conceptually similar to Evolutionary Algorithms, their distinct objectives (selection for acceleration vs. iterative solution optimization) require a fundamental redefinition of evolutionary mechanisms for our context.In response, we design sparse fitness evaluation, frequency-guided crossover, and augmented unbiased mutation to comprise EVOS. These components respectively guide sample selection with reduced computational cost, enhance performance through frequency-domain balance, and mitigate selection bias from cached evaluation. Extensive experiments demonstrate that our method achieves approximately 48\%-66\% reduction in training time while ensuring superior convergence without additional cost, establishing state-of-the-art acceleration among recent sampling-based strategies.
Poster
Shizhen Zhao · Xin Wen · Jiahui Liu · Chuofan Ma · Chunfeng Yuan · Xiaojuan Qi

[ ExHall D ]

Abstract
Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set fine-grained classes that are related to existing ones, aiming to enhance representation learning for both head and tail classes. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary …
Poster
Anshul Nasery · Jonathan Hayase · Pang Wei Koh · Sewoong Oh

[ ExHall D ]

Abstract
The democratization of machine learning systems has made the process of fine-tuning accessible to practitioners, leading to a wide range of open-source models fine-tuned on specialized tasks and datasets. Recent work has proposed to merge such models to combine their functionalities. However, prior approaches are usually restricted to models that are fine-tuned from the same base model. Furthermore, the final merged model is typically required to be of the same size as the original models. In this work, we propose a new two-step algorithm to merge models---termed PLeaS---which relaxes these constraints.First, leveraging the Permutation symmetries inherent in the two models, PLeaS partially matches nodes in each layer by maximizing alignment. Next, PLeaS computes the weights of the merged model as a layer-wise Least Squares solution to minimize the approximation error between the features of the merged model and the permuted features of the original models. PLeaS allows a practitioner to merge two models sharing the same architecture into a single performant model of a desired size, even when the two original models are fine-tuned from different base models. We also demonstrate how our method can be extended to address a challenging scenario where no data is available from the fine-tuning …
Poster
Jiayi Guo · Zhao Junhao · Chaoqun Du · Yulin Wang · Chunjiang Ge · Zanlin Ni · Shiji Song · Humphrey Shi · Gao Huang

[ ExHall D ]

Abstract
Test-time adaptation (TTA) aims to improve the performance of source-domain pre-trained models on previously unseen, shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. The recently proposed diffusion-driven TTA methods mitigate this by adapting model inputs instead of weights, where an unconditional diffusion model, trained on the source domain, transforms target-domain data into a synthetic domain that is expected to approximate the source domain. However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a Synthetic-Domain Alignment (SDA) framework. Our key insight is to fine-tune the source model with synthetic data to ensure better alignment. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This Mix of Diffusion (MoD) process mitigates the potential domain misalignment between the conditional and unconditional models. Extensive experiments across classifiers, segmenters, and …
Poster
Ke Ma · Jiaqi Tang · Bin Guo · Fan Dang · Sicong Liu · Zhui Zhu · Lei Wu · Cheng Fang · Ying-Cong Chen · Zhiwen Yu · Yunhao Liu

[ ExHall D ]

Abstract
Despite the growing integration of deep models into mobile and embedded terminals, the accuracy of these models often declines significantly during inference due to various deployment interferences. Test-time adaptation (TTA) has emerged as an effective strategy to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in memory-constrained IoT terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements during fully test-time adaptation (FTTA) without relying on specific network architectures or modifications to the original training procedure. Specifically, we propose a novel dynamic activation sparsity strategy that directly prunes activations at layer-specific dynamic ratios, allowing for flexible control of learning ability and memory cost in a data-sensitive manner during adaptation. Among this, two metrics, Gradient Importance and Layer Activation Memory, are considered to determine the layer-wise activation pruning ratios, reflecting accuracy contribution and memory efficiency, respectively. Experimentally, our method surpasses previous TTA baselines by not only reducing memory usage but also achieving superior accuracy, delivering SOTA performance across diverse datasets, network architectures, and tasks.
Poster
Qiang Zhang · Mengsheng Zhao · Jiawei Liu · Fanrui Zhang · Yongchao Xu · Zheng-Jun Zha

[ ExHall D ]

Abstract
Test-time adaptation using vision-language model (such as CLIP) to quickly adjust to distributional shifts of downstream tasks has shown great potential. Despite significant progress, existing methods are still limited to single-task test-time adaptation scenarios and have not effectively explored the issue of multi-task adaptation. To address this practical problem, we propose a novel Hierarchical Knowledge Prompt Tuning (HKPT) method, which achieves joint adaptation to multiple target domains by mining more comprehensive source domain discriminative knowledge and hierarchically modeling task-specific and task-shared knowledge. Specifically, HKPT constructs a CLIP prompt distillation framework that utilizes the broader source domain knowledge of large teacher CLIP to guide prompt tuning for lightweight student CLIP from multiple views during testing. Meanwhile, HKPT establishes task-specific dual dynamic knowledge graph to capture fine-grained contextual knowledge from continuous test data. And to fully exploit the complementarity among multiple target tasks, HKPT employs an adaptive task grouping strategy for achieving inter-task knowledge sharing. Furthermore, HKPT can seamlessly transfer to basic single-task test-time adaptation scenarios while maintaining robust performance. Extensive experimental results in both multi-task and single-task test-time adaptation settings demonstrate that our HKPT significantly outperforms state-of-the-art methods.
Poster
Jiangpeng He · Zhihao Duan · Fengqing Zhu

[ ExHall D ]

Abstract
Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining task-shared adapters to learn cross-task knowledge and task-specific adapters to capture the unique feature of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigates inter-task interference while maintaining the model's plasticity. Through comprehensive experiments across multiple benchmark datasets, we demonstrate that CL-LoRA consistently outperforms state-of-the-art methods while using fewer trainable parameters, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.
Poster
Jiashuo Li · Shaokun Wang · Bo Qian · Yuhang He · Xing Wei · Qiang Wang · Yihong Gong

[ ExHall D ]

Abstract
Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues.However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment.TSAI boosts compositionality through a patch-level adapter integration strategy, aggregating richer task-specific information while maintaining low computation costs.Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, on the one hand, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. On the other hand, the PFR promotes classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge, thereby preserving well-calibrated decision boundaries.Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on NECIL benchmark datasets while maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon …
Poster
Yunlong Li · Xiabi Liu · Liyuan Pan · Yuchen Ren

[ ExHall D ]

Abstract
Optimization-based meta-learning methods for few-shot one-class classification (FS-OCC) aim to fine-tune a meta-trained model to classify the positive and negative samples using only a few positive samples by adaptation. However, recent approaches primarily focus on adjusting existing meta-learning algorithms for FS-OCC, while overlooking issues stemming from the misalignment between the cross-entropy loss and OCC tasks during adaptation. This misalignment, combined with the limited availability of one-class samples and the restricted diversity of task-specific adaptation, can significantly exacerbate the adverse effects of gradient instability and generalization. To address these challenges, we propose a novel \textbf{T}ask-\textbf{S}pecific \textbf{G}radient \textbf{A}daptation (\textbf{TSGA}) for FS-OCC. Without extra supervision, TSGA learns to generate appropriate, stable gradients by leveraging label prediction and feature representation details of one-class samples and refines the adaptation process by recalibrating task-specific gradients and regularization terms. We evaluate TSGA on three challenging datasets and a real-world CNC Milling Machine application and demonstrate consistent improvements over baseline methods. Furthermore, we illustrate the critical impact of gradient instability and task-agnostic adaptation. Notably, TSGA achieves state-of-the-art results by effectively addressing these issues.
Poster
Peihua Deng · Jiehua Zhang · Xichun Sheng · Chenggang Yan · Yaoqi Sun · Ying Fu · Liang Li

[ ExHall D ]

Abstract
This paper explores the Class-Incremental Source-Free Unsupervised Domain Adaptation (CI-SFUDA) problem, where the unlabeled target data come incrementally without access to labeled source instances. This problem poses two challenges, the disturbances of similar source-class knowledge to target-class representation learning and the new target knowledge to old ones. To address them, we propose the Multi-Granularity Class Prototype Topology Distillation (GROTO) algorithm, which effectively transfers the source knowledge to the unlabeled class-incremental target domain. Concretely, we design the multi-granularity class prototype self-organization module and prototype topology distillation module. Firstly, the positive classes are mined by modeling two accumulation distributions. Then, we generate reliable pseudo-labels by introducing multi-granularity class prototypes, and use them to promote the positive-class target feature self-organization. Secondly, the positive-class prototypes are leveraged to construct the topological structures of source and target feature spaces. Then, we perform the topology distillation to continually mitigate the interferences of new target knowledge to old ones. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performances on three public datasets.
Poster
Xiran Wang · Jian Zhang · Lei Qi · Yinghuan Shi

[ ExHall D ]

Abstract
Domain generalization is proposed to address distribution shift, arising from statistical disparities between training source and unseen target domains. The widely used first-order meta-learning algorithms demonstrate strong performance for domain generalization by leveraging the gradient matching theory, which aims to establish balanced parameters across source domains to reduce overfitting to any particular domain. However, our analysis reveals that there are actually numerous directions to achieve gradient matching, with current methods representing just one possible path. These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. To address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. This approach, while adhering to the principles of gradient matching, promotes a more precise balance by estimating the centroid between domain-specific optimal parameters. Experimental results conducted on ten datasets validate the effectiveness of our strategy. Our code is available in the supplementary material.
Poster
Yushan Lai · Guowen Li · Haoyuan Liang · Juepeng Zheng · Zhiyu Ye

[ ExHall D ]

Abstract
Black-box Domain Adaptation (BDA) utilizes a black-box predictor of the source domain to label target domain data, addressing privacy concerns in Unsupervised Domain Adaptation (UDA). However, BDA assumes identical label sets across domains, which is unrealistic. To overcome this limitation, we propose a study on BDA with unknown classes in the target domain. It uses a black-box predictor to label target data and identify "unknown" categories, without requiring access to source domain data or predictor parameters, thus addressing both data privacy and category shift issues in traditional UDA. Existing methods face two main challenges: (i) Noisy pseudo-labels in knowledge distillation (KD) accumulate prediction errors, and (ii) relying on a preset threshold fails to adapt to varying category shifts. To address these, we propose ADU, a framework that allows the target domain to autonomously learn pseudo-labels guided by quality and use an adaptive threshold to identify "unknown" categories. Specifically, ADU consists of Selective Amplification Knowledge Distillation (SAKD) and Entopy-Driven Label Differentiation (EDLD). SAKD improves KD by focusing on high-quality pseudo-labels, mitigating the impact of noisy labels. EDLD categorizes pseudo-labels by quality and applies tailored training strategies to distinguish "unknown" categories, improving detection accuracy and adaptability. Extensive experiments show that ADU achieves …
Poster
Dongkwan Lee · Kyomin Hwang · Nojun Kwak

[ ExHall D ]

Abstract
We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability …
Poster
Zhenghao Zhao · Haoxuan Wang · Yuzhang Shang · Kai Wang · Yan Yan

[ ExHall D ]

Abstract
Dataset distillation (DD) aims to synthesize a small information-rich dataset from a large dataset for efficient neural network training. However, existing dataset distillation methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) The distillation process on imbalanced datasets develops biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. 2) The experts trained on such datasets perform suboptimally on tail classes, resulting in misguided distillation supervision and poor-quality soft-label initialization. To address these issues, we first propose Distribution-agnostic Matching to avoid directly matching the biased expert trajectories. It reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. Moreover, we improve the distillation guidance with Expert Decoupling, which jointly matches the decoupled backbone and classifier to improve the tail class performance and initialize reliable soft labels. This work pioneers the field of long-tailed dataset distillation (LTDD), marking the first effective effort to distill long-tailed datasets.
Poster
Changkun Ye · Russell Tsuchida · Lars Petersson · Nick Barnes

[ ExHall D ]

Abstract
Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class.In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. With reasonable assumptions on the ID/OOD classifier, the estimators are assembled into a sequence of three stages: 1) an estimate of the source label distribution of the OOD class, 2) an EM algorithm for Maximum Likelihood estimates (MLE) of the target label distribution, and 3) an estimate of the target label distribution of OOD class under relaxed assumptions on the OOD classifier.The sampling errors of estimates in 1) and 3) are quantified with a concentration inequality.The estimation result allows us to correct the ID classifier trained on the source distribution to the target distribution without retraining.Experiments on a variety of open set label shift settings demonstrate the effectiveness of our model in both estimation error and classification accuracy.
Poster
Yifeng Yang · Lin Zhu · Zewen Sun · Hengyu Liu · Qinying Gu · Nanyang Ye

[ ExHall D ]

Abstract
Out-of-distribution (OOD) detection remains challenging for deep learning models, particularly when test-time OOD samples differ significantly from training outliers. We propose OODD, a novel test-time OOD detection method that dynamically maintains and updates an OOD dictionary without fine-tuning. Our approach leverages a priority queue-based dictionary that accumulates representative OOD features during testing, combined with an informative inlier sampling strategy for in-distribution (ID) samples. To ensure stable performance during early testing, we propose a dual OOD stabilization mechanism that leverages strategically generated outliers derived from ID data. To our best knowledge, extensive experiments on the OpenOOD benchmark demonstrate that OODD significantly outperforms existing methods, achieving a 26.0\% improvement in FPR95 on CIFAR-100 Far OOD detection compared to the state-of-the-art approach. Furthermore, we present an optimized variant of the KNN-based OOD detection framework that achieves a 3x speedup while maintaining detection performance.
Poster
Yifei Zhang · Hao Zhu · Alysa Ziying Tan · Dianzhi Yu · Longtao Huang · Han Yu

[ ExHall D ]

Abstract
Federated learning (FL) has emerged as a promising paradigm for privacy-preserving collaborative machine learning. However, extending FL to class incremental learning settings introduces three key challenges: 1) spatial heterogeneity due to non-IID data distributions across clients, 2) temporal heterogeneity due to sequential arrival of tasks, and 3) resource heterogeneity due to diverse client capabilities. Existing approaches generally address these challenges in isolation, potentially leading to interference between updates, catastrophic forgetting, or excessive communication overhead. In this paper, we propose personalized Federated class-incremental parameter efficient fine-tuning with Mixture of Frequency aggregation (pFedMixF), a novel framework that simultaneously addresses all three heterogeneity challenges through frequency domain decomposition. Our key insight is that assigning orthogonal frequency components to different clients and tasks enables interference-free learning to be achieved with minimal communication costs. We further design an Auto-Task Agnostic Classifier that automatically routes samples to task-specific classifiers while adapting to heterogeneous class distributions.We conduct extensive experiments on three benchmark datasets, comparing our approach with eight state-of-the-art methods. The results demonstrate that \methodname{} achieves comparable test accuracy while requiring only 25% of the entire model parameters and incurring significantly lower communication costs than baseline methods.
Poster
Changlong Shi · He Zhao · Bingjie Zhang · Mingyuan Zhou · Dandan Guo · Yi Chang

[ ExHall D ]

Abstract
Federated Learning (FL) has emerged as a promising framework for distributed machine learning, enabling collaborative model training without sharing local data, thereby preserving privacy and enhancing security. However, data heterogeneity resulting from differences across user behaviors, preferences, and device characteristics poses a significant challenge for federated learning. Most previous works overlook the adjustment of aggregation weights, relying solely on dataset size for weight assignment, which often leads to unstable convergence and reduced model performance. Recently, several studies have sought to refine aggregation strategies by incorporating dataset characteristics and model alignment. However, adaptively adjusting aggregation weights while ensuring data security—without requiring additional proxy data—remains a significant challenge. In this work, we propose Federated learning with Adaptive Weight Aggregation (FedAWA), a novel method that adaptively adjusts aggregation weights based on client vectors during the learning process. The client vector captures the direction of model updates, reflecting local data variations, and is used to optimize the aggregation weight without requiring additional datasets or violating privacy. By assigning higher aggregation weights to local models whose updates align closely with the global optimization direction, FedAWA enhances the stability and generalization of the global model. Extensive experiments under diverse scenarios demonstrate the superiority of our method, …
Poster
Zhengyi Zhong · Weidong Bao · Ji Wang · Shuai Zhang · Jingxuan Zhou · Lingjuan Lyu · Wei Yang Bryan Lim

[ ExHall D ]

Abstract
Federated Learning is a promising paradigm for privacy-preserving collaborative model training. In practice, it is essential not only to continuously train the model to acquire new knowledge but also to guarantee old knowledge the right to be forgotten (i.e., federated unlearning), especially for privacy-sensitive information or harmful knowledge. However, current federated unlearning methods face several challenges, including indiscriminate unlearning of cross-client knowledge, irreversibility of unlearning, and significant unlearning costs. To this end, we propose a method named FUSED, which first identifies critical layers by analyzing each layer’s sensitivity to knowledge and constructs sparse unlearning adapters for sensitive ones. Then, the adapters are trained without altering the original parameters, overwriting the unlearning knowledge with the remaining knowledge. This knowledge overwriting process enables FUSED to mitigate the effects of indiscriminate unlearning. Moreover, the introduction of independent adapters makes unlearning reversible and significantly reduces the unlearning costs. Finally, extensive experiments on five datasets across three unlearning scenarios demonstrate that FUSED's effectiveness is comparable to Retraining, surpassing all other baselines while greatly reducing unlearning costs. The code is available at https://anonymous.4open.science/r/FUSED-4E8E.
Poster
Yongli Xiang · Ziming Hong · Lina Yao · Dadong Wang · Tongliang Liu

[ ExHall D ]

Abstract
Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising, The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) *data-intrinsic disguising (DID)* for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) *model-guided disguising (MGD)* for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 54.3% in …
Poster
Zhaoyu Zhang · Yang Hua · Guanxiong Sun · Hui Wang · Seán F. McLoone

[ ExHall D ]

Abstract
Data Efficient Generative Adversarial Networks (DE-GANs) have become more and more popular in recent years. Existing methods apply data augmentation, noise injection and pre-trained models to maximumly increase the number of training samples thus improving the training of DE-GANs. However, none of these methods considers the sample quality during training, which can also significantly influence the DE-GANs training. Focusing on the sample quality during training, in this paper, we are the first to incorporate discriminator rejection sampling (DRS) into the training process and introduce a novel method, called quality aware dynamic discriminator rejection sampling (QADDRS). Specifically, QADDRS consists of two steps: (1) the sample quality aware step, which aims to obtain the sorted critic scores, i.e., the ordered discriminator outputs, on real/fake samples in the current training stage; (2) the dynamic rejection step that obtains dynamic rejection number N, where N is controlled by the overfitting degree of D during training. When updating the parameters of the D, the N high critic score real samples and the N low critic score fake samples in the minibatch are rejected based on the overfitting degree of D dynamically. As a result, QADDRS can avoid D becoming overly confident in distinguishing both real …
Poster
Ming Sun · Rui Wang · Zixuan Zhu · Lihua Jing · Yuanfang Guo

[ ExHall D ]

Abstract
High-quality open-source datasets are essential for advancing deep neural networks. However, the unauthorized commercial use of these datasets has raised significant concerns about copyright protection. One promising approach is backdoor watermark-based dataset ownership verification (BW-DOV), in which dataset protectors implant specific backdoors into illicit models through dataset watermarking, enabling tracing these models through abnormal prediction behaviors. Unfortunately, the targeted nature of these BW-DOV methods can be maliciously exploited, potentially leading to harmful side effects. While existing harmless methods attempt to mitigate these risks, watermarked datasets can still negatively affect prediction results, partially compromising dataset functionality. In this paper, we propose a more harmless backdoor watermark, called EntropyMark, which improves prediction confidence without altering the final prediction results. For this purpose, an entropy-based constraint is introduced to regulate the probability distribution. Specifically, we design an iterative clean-label dataset watermarking framework. Our framework employs gradient matching and adaptive data selection to optimize backdoor injection. In parallel, we introduce a hypothesis test method grounded in entropy inconsistency to verify dataset ownership. Extensive experiments on benchmark datasets demonstrate the effectiveness, transferability, and defense resistance of our approach.
Poster
Yong Xie · Weijie Zheng · Hanxun Huang · Guangnan Ye · Xingjun Ma

[ ExHall D ]

Abstract
As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks.In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, showing that PMA outperforms the current state-of-the-art individual methods.Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
Poster
Jiani Ni · He Zhao · Jintong Gao · Dandan Guo · Hongyuan Zha

[ ExHall D ]

Abstract
In recent years, deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. However, despite their success, they often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare, where unreliable predictions can have serious consequences. Recent research starts to improve model calibration from the view of classifier. However, the explore about designing the classifier to solve the model calibration problem is insufficient. Let alone most of existing methods ignore the calibration errors arising from underconfidence. In this work, we propose a novel method by Balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. By introducing a confidence-tunable module and a dynamic adjustment method, we ensure better alignment between model confidence and its true accuracy. Extensive experimental validation shows that ours significantly improves model calibration performance while maintaining high predictive accuracy, outperforming existing techniques. This provides a novel solution to the calibration challenges commonly encountered in deep learning.
Poster
Xu Yan · Jun Yin · Jie Wen

[ ExHall D ]

Abstract
In incomplete multi-view multi-label learning scenarios, it is crucial to use the missing multi-view data to extract consistent and specific representations from different data sources and to fully utilize the missing label information. However, most of the previous approaches ignore the separation problem between view-shared information and specific information. To address this problem, in this paper, we build an approach that can separate view consistent features from view specific features under the Variational Autoencoder (VAE) framework. Specifically, first we introduce cross-view reconstruction to learn view consistent features, and extract the shared information in each view through unsupervised pre-training. Subsequently, we develop a disentangling module to learn the disentangled specific features by minimizing the upper bound of mutual information between the consistent features and the specific features. Finally, we utilize a priori label relevance to guide the learning of label semantic embeddings, aggregating relevant semantic embeddings and maintaining the topology of label relevance in the semantic space. In extensive experiments, our model outperforms existing state-of-the-art algorithms on several real-world datasets, which fully validates its strong adaptability in the absence of views and labels.
Poster
Yuan Sun · Yongxiang Li · Zhenwen Ren · Guiduo Duan · Dezhong Peng · Peng Hu

[ ExHall D ]

Abstract
Multi-view clustering (MVC) aims to exploit complementary information from diverse views to enhance clustering performance. Since pseudo-labels can provide additional semantic information, many MVC methods have been proposed to guide unsupervised multi-view learning through pseudo-labels. These methods implicitly assume that the predicted pseudo-labels are predicted correctly. However, due to the challenges in training a flawless unsupervised model, this assumption can be easily violated, thereby leading to the Noisy Pseudo-label Problem (NPP). Moreover, these existing approaches typically rely on the assumption of perfect cross-view alignment. In practice, it is frequently compromised due to noise or sensor differences, thereby resulting in the Noisy Correspondence Problem (NCP). Based on the above observations, we reveal and study unsupervised multi-view learning under NPP and NCP. To this end, we propose Robust Noisy Pseudo-label Learning (ROLL) to prevent the overfitting problem caused by both NPP and NCP. Specifically, we first adopt traditional contrastive learning to warm up the model, thereby generating the pseudo-labels in a self-supervised manner. Afterward, we propose noise-tolerance pseudo-label learning to deal with the noise in the predicted pseudo-labels, thereby embracing the robustness against NPP. To further mitigate the overfitting problem, we present robust multi-view contrastive learning to mitigate the negative impact of …
Poster
Rittwika Kansabanik · Adrian Barbu

[ ExHall D ]

Abstract
Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class.This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach theoretically guarantees true feature recovery under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.
Poster
Jiahua Rao · hanjing Lin · Leyu Chen · Jiancong Xie · Shuangjia Zheng · Yuedong Yang

[ ExHall D ]

Abstract
Phenotypic drug discovery presents a promising strategy for identifying first-in-class drugs by bypassing the need for specific drug targets. Recent advances in cell-based phenotypic screening tools, including Cell Painting and the LINCS L1000, provide essential cellular data that capture biological responses to compounds. While the integration of the multi-modal data enhances the use of contrastive learning (CL) methods for molecular phenotypic representation, these approaches treat all negative pairs equally, failing to discriminate molecules with similar phenotypes. To address these challenges, we introduce a foundational framework MINER that dynamically estimates the likelihoods of sample pairs as negative pairs based on uni-modal disentangled representations. In addition, our approach incorporates a mixture fusion strategy to effectively integrate multimodal data, even in cases where certain modalities are missing. Extensive experiments demonstrate that our method enhances both molecular property prediction and molecule-phenotype retrieval accuracy. Moreover, it successfully recommends drug candidates from phenotype for complex diseases documented in the literature. These findings underscore MINER’s potential to advance drug discovery by enabling deeper insights into disease mechanisms and improving drug candidate recommendations.
Poster
Wanyi Chen · Zihua Zhao · Jiangchao Yao · Ya Zhang · Jiajun Bu · Haishuai Wang

[ ExHall D ]

Abstract
Recent advances in medical AI have shown a clear trend towards large models in healthcare. However, developing large models for multi-modal medical diagnosis remains challenging due to a lack of sufficient modal-complete medical data. Most existing multi-modal diagnostic models are relatively small and struggle with limited feature extraction capabilities. To bridge this gap, we propose **AdaCoMed**, an **ada**ptive **co**llaborative-learning framework that synergistically integrates the off-the-shelf **med**ical single-modal large models with multi-modal small models. Our framework first employs a mixture-of-modality-experts (MoME) architecture to combine features extracted from multiple single-modal medical large models, and then introduces a novel adaptive co-learning mechanism to collaborate with a multi-modal small model. This co-learning mechanism, guided by an adaptive weighting strategy, dynamically balances the complementary strengths between the MoME-fused large model features and the cross-modal reasoning capabilities of the small model. Extensive experiments on two representative multi-modal medical datasets (MIMIC-IV-MM and MMIST ccRCC) across six modalities and four diagnostic tasks demonstrate consistent improvements over state-of-the-art baselines, making it a promising solution for real-world medical diagnosis applications.
Poster
Yuan Tian · Kaiyuan Ji · Rongzhao Zhang · Yankai Jiang · Chunyi Li · Xiaosong Wang · Guangtao Zhai

[ ExHall D ]

Abstract
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection.In this paper, we introduce a thorough benchmark and a unified model for this problem.First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data.Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features.Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images.We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance.Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection.
Poster
Alice Heiman · Xiaoman Zhang · Emma Chen · Sung Eun Kim · Pranav Rajpurkar

[ ExHall D ]

Abstract
Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report.After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78\% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0\% in reducing measurement hallucinations measured by mean absolute error.
Poster
Ta Duc Huy · Sen Kim Tran · Phan Nguyen · Nguyen Hoang Tran · Tran Bao Sam · Anton van den Hengel · Zhibin Liao · Johan Verjans · Minh-Son To · Vu Minh Hieu Phan

[ ExHall D ]

Abstract
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.\% across three biomedical datasets.
Poster
Tim Lenz · Peter Neidlinger · Marta Ligero · Georg Wölflein · Marko van Treeck · Jakob Nikolas Kather

[ ExHall D ]

Abstract
Representation learning of pathology whole-slide images (WSIs) has primarily relied on weak supervision with Multiple Instance Learning (MIL). This approach leads to slide representations highly tailored to a specific clinical task. Self-supervised learning (SSL) has been successfully applied to train histopathology foundation models (FMs) for patch embedding generation. However, generating patient or slide level embeddings remains challenging. Existing approaches for slide representation learning extend the principles of SSL from patch level learning to entire slides by aligning different augmentations of the slide or by utilizing multimodal data. By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. Our contrastive pretraining strategy, called COBRA, employs multiple FMs and an architecture based on Mamba-2. COBRA exceeds performance of state-of-the-art slide encoders on four different public CPTAC cohorts on average by at least +3.8% AUC, despite only being pretrained on 3048 WSIs from TCGA. Additionally, COBRA is readily compatible at inference time with previously unseen feature extractors.
Poster
Jiuyang Dong · Junjun Jiang · Kui Jiang · Jiahan Li · Yongbing Zhang

[ ExHall D ]

Abstract
Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs).To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches.HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts.During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches.These masks then guide the training of LIPN to predict the relevance of each low-resolution patch.During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation.In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers.Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13\% in AUC while reducing inference time by 28.6\% on the Camelyon16 dataset.The project will be available.
Poster
Junchao Zhu · Ruining Deng · Tianyuan Yao · Juming Xiong · Chongyu Qu · Junlin Guo · Siqi Lu · Mengmeng Yin · Yu Wang · Shilin Zhao · Haichun Yang · Yuankai Huo

[ ExHall D ]

Abstract
Spatial transcriptomics (ST) is an emerging technology that enables medical computer vision scientists to automatically interpret the molecular profiles underlying morphological features. Currently, however, most deep learning-based ST analyses are limited to two-dimensional (2D) sections, which can introduce diagnostic errors due to the heterogeneity of pathological tissues across 3D sections. Expanding ST to three-dimensional (3D) volumes is challenging due to the prohibitive costs; a 2D ST acquisition already costs over 50 times more than whole slide imaging (WSI), and a full 3D volume with 10 sections can be an order of magnitude more expensive. To reduce costs, scientists have attempted to predict ST data directly from WSI without performing actual ST acquisition. However, these methods typically yield unsatisfying results. To address this, we introduce a novel problem setting: 3D ST imputation using 3D WSI histology sections combined with a single 2D ST slide. To do so, we present the Anatomy-aware Spatial Imputation Graph Network (ASIGN) for more precise, yet affordable, 3D ST modeling. The ASIGN architecture extends existing 2D spatial relationships into 3D by leveraging cross-layer overlap and similarity-based expansion. Moreover, a multi-level spatial attention graph network integrates features comprehensively across different data sources. We evaluated ASIGN on three public …
Poster
Ming Hu · Jianfu Yin · Zhuangzhuang Ma · Jianheng Ma · Feiyu Zhu · Bingbing Wu · Ya Wen · Meng Wu · C Hu · Bingliang Hu · Quan Wang

[ ExHall D ]

Abstract
Co-training has achieved significant success in the field of semi-supervised learning; however, the *homogenization phenomenon*, which arises from multiple models tending towards similar decision boundaries, remains inadequately addressed. To tackle this issue, we propose a novel algorithm called **&beta;-FFT** from the perspectives of data diversity and training structure.First, from the perspective of data diversity, we introduce a nonlinear interpolation method based on the **Fast Fourier Transform (FFT)**. This method generates more diverse training samples by swapping low-frequency components between pairs of images, thereby enhancing the model's generalization capability. Second, from the structural perspective, we propose a differentiated training strategy to alleviate the homogenization issue in co-training. In this strategy, we apply additional training with labeled data to one model in the co-training framework, while employing linear interpolation based on the **Beta (&beta;)** distribution for the unlabeled data as a regularization term for the additional training. This approach enables us to effectively utilize the limited labeled data while simultaneously improving the model's performance on unlabeled data, ultimately enhancing the overall performance of the system.Experimental results demonstrate that **&beta;-FFT** outperforms current state-of-the-art (SOTA) methods on three public medical image datasets.
Poster
Maregu Assefa · Muzammal Naseer · IYYAKUTTI IYAPPAN GANAPATHI · Syed Sadaf Ali · Mohamed L Seghier · Naoufel Werghi

[ ExHall D ]

Abstract
Semi-supervised learning in medical image segmentation leverages unlabeled data to reduce annotation burdens through consistency learning. However, current methods struggle with class imbalance and high uncertainty from pathology variations, leading to inaccurate segmentation in 3D medical images. To address these challenges, we present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that enhances the generalization of consistency methods with two complementary losses: Uncertainty-aware Consistency Loss (UnCL) and Focal Entropy-aware Contrastive Loss (FeCL). UnCL enforces global consistency by dynamically weighting the contribution of each voxel to the consistency loss based on its uncertainty, preserving high-uncertainty regions instead of filtering them out. Initially, UnCL prioritizes learning from uncertain voxels with lower penalties, encouraging the model to explore challenging regions. As training progress, the penalty shift towards confident voxels to refine predictions and ensure global consistency. Meanwhile, FeCL enhances local feature discrimination in imbalanced regions by introducing dual focal mechanisms and adaptive confidence adjustments into the contrastive principle. These mechanisms jointly prioritizes hard positives and negatives while focusing on uncertain sample pairs, effectively capturing subtle lesion variations under class imbalance. Extensive evaluations on four diverse medical image segmentation datasets (ISLES'22, BraTS'19, LA, Pancreas) show DyCON's superior performance against SOTA methods.
Poster
Saad Wazir · Daeyoung Kim

[ ExHall D ]

Abstract
Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76\% on MoNuSeg, 3.12\% on DSB, 2.87\% on Electron Microscopy, and 4.03\% on TNBC datasets compared to existing SOTA methods. The necessary codes and checkpoints for reproduction will be released publicly later.
Poster
Maximilian Rokuss · Yannick Kirchhoff · Seval Akbal · Balint Kovacs · Saikat Roy · Constantin Ulrich · Tassilo Wald · Lukas T. Rotkopf · Heinz-Peter Schlemmer · Klaus Maier-Hein

[ ExHall D ]

Abstract
In this work, we present LesionLocator, a framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, establishing the first end-to-end model capable of 4D tracking with dense spatial prompts. Our model leverages an extensive dataset of 23,262 annotated medical scans, as well as synthesized longitudinal data across diverse lesion types. The diversity and scale of our dataset significantly enhances model generalizability to real-world medical imaging challenges and addresses key limitations in longitudinal data availability. LesionLocator outperforms all existing promptable models in lesion segmentation by nearly 10 dice points, reaching human-level performance, and achieves state-of-the-art results in lesion tracking, with superior lesion retrieval and segmentation accuracy. LesionLocator not only sets a new benchmark in universal promptable lesion segmentation and automated longitudinal lesion tracking but also provides the first open-access solution of its kind, releasing our synthetic 4D dataset and model to the community, empowering future advancements in medical imaging. Code will be made available at: www.github.com/anonymous
Poster
Junjie Zhou · Shouju Wang · Yuxia Tang · Qi Zhu · Daoqiang Zhang · WEI SHAO

[ ExHall D ]

Abstract
The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among multi-modal TME components may cause side effects i.e., the best uni-modal model may outperform the joint generative model. To address the above issues, we propose a \Divergence-Aware Multi-Modal Diffusion model (i.e., DAMM-Diffusion) to adaptively generate the prediction results from uni-modal and multi-modal branches in a unified network. In detail, the uni-modal branch is composed of the U-Net architecture while the multi-modal branch extends it by introducing two novel fusion modules i.e., Multi-Modal Fusion Module (MMFM) and Uncertainty-Aware Fusion Module (UAFM). Specifically, the MMFM is proposed to fuse features from multiple modalities, while the UAFM module is introduced to learn the uncertainty map for cross-attention computation. Following the individual prediction results from each branch, the Divergence-Aware Multi-Modal Predictor (DAMMP) module is proposed to assess the consistency of multi-modal data with the uncertainty map, which determines whether the final prediction results come from multi-modal or …
Poster
Ziwei Zhao · Zhixing Zhang · Yuhang Liu · Zhao Zhang · Haojun Yu · Dong Wang · Liwei Wang

[ ExHall D ]

Abstract
In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework.
Poster
S. Mazdak Abulnaga · Andrew Hoopes · Neel Dey · Malte Hoffmann · Bruce Fischl · John Guttag · Adrian V. Dalca

[ ExHall D ]

Abstract
We present a method for constructing anatomical atlases on the fly. An atlas is an image that represents the prototypical structure of a collection of images. Among other uses, atlases play a key role in studies of anatomical variability across populations. Existing atlas construction methods are computationally prohibitive, requiring days to weeks of computation. Consequently, many scientific studies are forced to use suboptimal atlases constructed for different population groups, negatively impacting downstream analyses. In this work, we present MultiMorph, a model that rapidly produces 3D anatomical atlases for any set of brain MRI images. MultiMorph enables medical researchers with no machine learning background to rapidly construct high-quality population-specific atlases in a single forward network pass, without requiring any fine tuning or optimization. MultiMorph is based on a linear group-interaction layer that aggregates and shares features within the group of input images. We demonstrate that MultiMorph outperforms state-of-the-art optimization-based and machine-learning-based atlas construction methods in both small and large population settings. It generates better atlases with a 100-fold reduction in computational time. Further, we demonstrate generalization to new imaging modalities and population groups at test-time.
Poster
Yejee Shin · Yeeun Lee · Hanbyol Jang · Geonhui Son · Hyeongyu Kim · Dosik Hwang

[ ExHall D ]

Abstract
Multi-contrast magnetic resonance (MR) images offer critical diagnostic information but are limited by long scan times and high cost. While diffusion models (DMs) excel in medical image synthesis, they often struggle to maintain anatomical consistency and utilize the diverse characteristics of multi-contrast MR images effectively.We propose APT, a unified diffusion model designed to generate accurate and anatomically consistent multi-contrast MR images. APT introduces a mutual information fusion module and an anatomical consistency loss to preserve critical anatomical structures across multiple contrast inputs. To enhance synthesis, APT incorporates a two-stage inference process: in the first stage, a prior codebook provides coarse anatomical structures by selecting appropriate guidance based on precomputed similarity mappings and Bézier curve transformations. The second stage applies iterative unrolling with weighted averaging to refine the initial output, enhancing fine anatomical details and ensuring structural consistency.This approach enables the preservation of both global structures and local details, resulting in realistic and diagnostically valuable synthesized images. Extensive experiments on public multi-contrast MR brain images demonstrate that our approach significantly outperforms state-of-the-art methods.
Poster
Thomas Walker · Salvatore Esposito · Daniel Rebain · Amir Vaxman · Arno Onken · Changjian Li · Oisin Mac Aodha

[ ExHall D ]

Abstract
Reconstructing complex structures from planar cross-sections is a challenging problem, with wide-reaching applications in medical imaging, manufacturing, and topography. Out-of-the-box point cloud reconstruction methods can often fail due to the data sparsity between slicing planes, while current bespoke methods struggle to reconstruct thin geometric structures and preserve topological continuity. This is important for medical applications where thin vessel structures are present in CT and MRI scans. This paper introduces CrossSDF, a novel approach for extracting a 3D signed distance field from 2D signed distances generated from planar contours. Our approach makes the training of neural SDFs contour-aware by using losses designed for the case where geometry is known within 2D slices. Our results demonstrate a significant improvement over existing methods, effectively reconstructing thin structures and producing accurate 3D models without the interpolation artifacts or over-smoothing of prior approaches.