Poster
Zhedong Zhang · Liang Li · Chenggang Yan · Chunshan Liu · Anton van den Hengel · Yuankai Qi
[ ExHall D ]
Abstract
Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker’s voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at https://A2P-Dubber.github.io/.
Poster
Hao Li · Ju Dai · Xin Zhao · Feng Zhou · Junjun Pan · Lei Li
[ ExHall D ]
Abstract
In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module—Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. The source code will be publicly available.
Poster
Xiaozhong Ji · Xiaobin Hu · Zhihong Xu · Junwei Zhu · Chuming Lin · Qingdong He · Jiangning Zhang · Donghao Luo · Yi Chen · Qin Lin · qinglin lu · Chengjie Wang
[ ExHall D ]
Abstract
The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal inconsistencies.Considering the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to shift focus on the exploration of global audio perception. To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall perception. For the intra-clip audio perception, 1). Context-enhanced audio learning, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). Motion-decoupled Controller, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips …
Poster
Xuanchen Li · Jianyu Wang · Yuhao Cheng · Yikun Zeng · Xingyu Ren · Wenhan Zhu · Weiming Zhao · Yichao Yan
[ ExHall D ]
Abstract
Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset TexTalk4D, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework TexTalker to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements.
Poster
Tim Büchner · Christoph Anders · Orlando Guntinas-Lichius · Joachim Denzler
[ ExHall D ]
Abstract
The relationship between muscle activity and resulting facial expressions is crucial for various fields, including psychology, medicine, and entertainment.The synchronous recording of facial mimicry and muscular activity via surface electromyography (sEMG) provides a unique window into these complex dynamics.Unfortunately, existing methods for facial analysis cannot handle electrode occlusion, rendering them ineffective.Even with occlusion-free reference images of the same person, variations in expression intensity and execution are unmatchable.Our electromyography-informed facial expression reconstruction (EIFER) approach is a novel method to restore faces under sEMG occlusion faithfully in an adversarial manner.We decouple facial geometry and visual appearance (e.g., skin texture, lighting, electrodes) by combining a 3D Morphable Model (3DMM) with neural unpaired image-to-image translation via reference recordings.Then, EIFER learns a bidirectional mapping between 3DMM expression parameters and muscle activity, establishing correspondence between the two domains. We validate the effectiveness of our approach through experiments on a dataset of synchronized sEMG recordings and facial mimicry, demonstrating faithful geometry and appearance reconstruction.Further, we synthesize expressions based on muscle activity and how observed expressions can predict dynamic muscle activity.Consequently, EIFER introduces a new paradigm for facial electromyography, which could be extended to other forms of multi-modal face recordings.
Poster
Mingtao Guo · Guanyu Xing · Yanli Liu
[ ExHall D ]
Abstract
Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.
Poster
Tuur Stuyck · Gene Wei-Chin Lin · Egor Larionov · Hsiaoyu Chen · Aljaž Božič · Nikolaos Sarafianos · Doug Roble
[ ExHall D ]
Abstract
Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hair styles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method's effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting 1000 grooms in 0.3 seconds. Code will be released.
Poster
Weiqi Feng · Dong Han · Zekang Zhou · Shunkai Li · Xiaoqiang Liu · Pengfei Wan · Di ZHANG · Miao Wang
[ ExHall D ]
Abstract
Existing radiance field-based head avatar methods have mostly relied on pre-computed explicit priors (e.g., mesh, point) or neural implicit representations, making it challenging to achieve high fidelity with both computational efficiency and low memory consumption. To overcome this, we present GPAvatar, a novel and efficient Gaussian splatting-based method for reconstructing high-fidelity dynamic 3D head avatars from monocular videos. We extend Gaussians in 3D space to a high-dimensional embedding space encompassing Gaussian's spatial position and avatar expression, enabling the representation of the head avatar with arbitrary pose and expression. To enable splatting-based rasterization, a linear transformation is learned to project each high-dimensional Gaussian back to the 3D space, which is sufficient to capture expression variations instead of using complex neural networks. Furthermore, we propose an adaptive densification strategy that dynamically allocates Gaussians to regions with high expression variance, improving the facial detail representation. Experimental results on three datasets show that our method outperforms existing state-of-the-art methods in rendering quality and speed while reducing memory usage in training and rendering.
Poster
Hongrui Cai · Yuting Xiao · Xuan Wang · Jiafei Li · Yudong Guo · Yanbo Fan · Shenghua Gao · Juyong Zhang
[ ExHall D ]
Abstract
We introduce a novel approach to creating ultra-realistic head avatars and rendering them in real time (≥30 fps at 2048×1334 resolution). First, we propose a hybrid explicit representation that combines the advantages of two primitive based efficient rendering techniques. UV-mapped 3D mesh is utilized to capture sharp and rich textures on smooth surfaces, while 3D Gaussian Splatting is employed to represent complex geometric structures. In the pipeline of modeling an avatar, after tracking parametric models based on captured multi-view RGB videos, our goal is to simultaneously optimize the texture and opacity map of mesh, as well as a set of 3D Gaussian splats localized and rigged onto the mesh facets. Specifically, we perform α-blending on the color and opacity values based on the merged and re-ordered z-buffer from the rasterization results of mesh and 3DGS. This process involves the mesh and 3DGS adaptively fitting the captured visual information to outline a high-fidelity digital avatar. To avoid artifacts caused by Gaussian splats crossing the mesh facets, we design a stable hybrid depth sorting strategy. Experiments illustrate that our modeled results exceed those of state-of-the-art approaches.
Poster
Jack Saunders · Charlie Hewitt · Yanan Jian · Marek Kowalski · Tadas Baltrusaitis · Yiye Chen · Darren Cosker · Virginia Estellers · Nicholas Gydé · Vinay P. Namboodiri · Benjamin E Lundell
[ ExHall D ]
Abstract
Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or videoand fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360∘ rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware.
Poster
Rong Wang · Fabian Prada · Ziyan Wang · Zhongshi Jiang · Chengxiang Yin · Junxuan Li · Shunsuke Saito · Igor Santesteban · Javier Romero · Rohan Joshi · Hongdong Li · Jason Saragih · Yaser Sheikh
[ ExHall D ]
Abstract
We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually …
Poster
Jingyu Zhuang · Di Kang · Linchao Bao · Liang Lin · Guanbin Li
[ ExHall D ]
Abstract
Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
Poster
Zedong Chu · Feng Xiong · Meiduo Liu · Jinzhi Zhang · Mingqi Shao · Zhaoxu Sun · Di Wang · Mu Xu
[ ExHall D ]
Abstract
With the rapid evolution of 3D generation algorithms, the cost of producing 3D humanoid character models has plummeted, yet the field is impeded by the lack of a comprehensive dataset for automatic rigging—a pivotal step in character animation. Addressing this gap, we present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging, encompassing 11,434 meticulously curated T-posed meshes adhered to a uniform skeleton topology. Capitalizing on this dataset, we introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods in handling complex AI-generated meshes. Our approach integrates a Prior-Guided Skeleton Estimator (PGSE) module, which uses 2D skeleton joints to provide a preliminary 3D skeleton, and a Mesh-Skeleton Mutual Attention Network (MSMAN) that fuses skeleton features with 3D mesh features extracted by a U-shaped point transformer. This enables a coarse-to-fine 3D skeleton joint regression and a robust skinning estimation, surpassing previous methods in quality and versatility. This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.
Poster
Yuanyou Xu · Zongxin Yang · Yi Yang
[ ExHall D ]
Abstract
Controllable generation has achieved substantial progress in both 2D and 3D domains, yet current conditional generation methods still face limitations in describing detailed shape structures. Skeletons can effectively represent and describe object anatomy and pose. Unfortunately, past studies are often limited to human skeletons. In this work, we generalize skeletal conditioned generation to arbitrary structures. First, we design a reliable mesh skeletonization pipeline to generate a large-scale mesh-skeleton paired dataset.Based on the dataset, a multi-view and 3D generation pipeline is built. We propose to represent 3D skeletons by Coordinate Color Encoding as 2D conditional images. A Skeletal Correlation Module is designed to extract global skeletal features for condition injection. After multi-view images are generated, 3D assets can be obtained by incorporating a large reconstruction model, followed with a UV texture refinement stage. As a result, our method achieves instant generation of multi-view and 3D contents which are aligned with given skeletons. The proposed techniques largely improve the object-skeleton alignment and generation quality.Project page at https://skdream3d.github.io/. Dataset, code and models will be released in public.
Poster
Xingchao Yang · Takafumi Taketomi · Yuki Endo · Yoshihiro Kanamori
[ ExHall D ]
Abstract
Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.
Poster
Gangjian Zhang · Nanjie Yao · Shunsi Zhang · hanfeng Zhao · Guoliang Pang · Jian Shu · Hao Wang
[ ExHall D ]
Abstract
This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.
Poster
Zichen Tang · Yuan Yao · Miaomiao Cui · Liefeng Bo · Hongyu Yang
[ ExHall D ]
Abstract
Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like score distillation sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that …
Poster
Yingmao Miao · Zhanpeng Huang · Rui Han · Zibin Wang · Chenhao Lin · Chao Shen
[ ExHall D ]
Abstract
While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experiment results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.
Poster
Sumit Chaturvedi · Mengwei Ren · Yannick Hold-Geoffroy · Jingyuan Liu · Julie Dorsey · ZHIXIN SHU
[ ExHall D ]
Abstract
We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels must be transformed in response to changes in environmental lighting conditions. We synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting using a physically-based rendering engine. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs, produces realistic illumination effects such as specular highlights and cast shadows, while preserving identity well. Our quantitative experiments on Light Stage testing data demonstrate results comparable to state-of-the-art relighting methods trained with Light Stage data. Our qualitative results on in-the-wild images show high quality illumination effects for portraits that have never been achieved before with traditional supervision.
Poster
Junying Wang · Jingyuan Liu · Xin Sun · Krishna Kumar Singh · ZHIXIN SHU · He Zhang · Jimei Yang · Nanxuan Zhao · Tuanfeng Y. Wang · Simon Su Chen · Ulrich Neumann · Jae Shin Yoon
[ ExHall D ]
Abstract
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g, face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth.In inference time, our motion module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.
Poster
Kenji Enomoto · Scott Cohen · Brian Price · TJ Rhodes
[ ExHall D ]
Abstract
This paper considers the long-standing problem of extracting alpha mattes from video using a known background. While various color-based or polarization-based approaches have been studied in past decades, the problem remains ill-posed because the solutions solely rely on either color or polarization. We introduce Polarized Color Screen Matting, a single-shot, per-pixel matting theory for alpha matte and foreground color recovery using both color and polarization cues. Through a theoretical analysis of our diffuse-specular polarimetric compositing equation, we derive practical closed-form matting methods with their solvability conditions. Our theory concludes that an alpha matte can be extracted without manual corrections using off-the-shelf equipment such as an LCD monitor, polarization camera, and unpolarized lights with calibrated color. Experiments on synthetic and real-world datasets verify the validity of our theory and show the capability of our matting methods on real videos with quantitative and qualitative comparisons to color-based and polarization-based matting methods.
Poster
Ning Ni · Libao Zhang
[ ExHall D ]
Abstract
Recently, improving the residual structure and designing efficient convolutions have become important branches of lightweight visual reconstruction model design. We have observed that the feature addition mode (FAM) in existing residual structure tends to lead to slow feature learning or stagnation in feature evolution, a phenomenon we define as network inertia. In addition, although blueprint separable convolutions (BSConv) have proved the dominance of intra-kernel correlation, BSConv forces the blueprint to perform scale transformation on all channels, which may lead to incorrect intra-kernel correlation and introduce useless or disruptive features on some channels and hinder the effective propagation of features. Therefore, in this paper, we rethink the FAM and BSConv for super-light visual reconstruction framework design. First, we design a novel linking mode, called feature diversity evolution link (FDEL), which aims to alleviate the phenomenon of network inertia by reducing the retention of previous low-level features, thereby promoting the evolution of feature diversity. Second, we propose blueprint controllable convolutions (B2Conv). The B2Conv can adaptively pick accurate intra-kernel correlation in the depth-axis, effectively prevent the introduction of useless or disruptive features. Based on FDEL and B2Conv, we develop a super-light super-resolution (SR) framework SLVR for visual reconstruction. Both FDEL and B2Conv can …
Poster
Ping Wang · Lishun Wang · Gang Qu · Xiaodong Wang · Yulun Zhang · Xin Yuan
[ ExHall D ]
Abstract
Plug-and-play (PnP) and deep-unrolling solvers have become the de-facto standard tools for inverse problems in single-pixel imaging (SPI). The former, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep Gaussian denoiser, behaves well on interpretation and generalization, but poorly on accuracy and running time. By contrast, the latter results in better accuracy with faster inference, but cannot generalize well on varying degradations and cannot be interpreted as the proximal operator of any function. In this paper, we tackle the challenges of integrating the merits of both kinds of solvers into one. To this end, we propose a proximal optimization trajectory loss and an image restoration network to develop learned proximal restorers (LPRs) as PnP priors. LPRs approximate the proximal operator of an explicit restoration regularization and guarantees the fast convergence of PnP-HQS (half-qauadratic splitting algorithm) and PnP-ADMM (alternating direction method of multipliers). Besides, extensive experiments show that our algorithms not only achieve state-of-the-art accuracy in almost real time but also generalize well on varying degradations caused by changing imaging settings (resolution and compression ratio). The source code will be released soon.
Poster
Bojian Wu · YIFAN PENG · Ruizhen Hu · Xiaowei Zhou
[ ExHall D ]
Abstract
The challenge of image-based 3D reconstruction for glossy objects lies in separating diffuse and specular components on glossy surfaces from captured images, a task complicated by the ambiguity in discerning lighting conditions and material properties using RGB data alone. While state-of-the-art methods rely on tailored and/or high-end equipment for data acquisition, which can be cumbersome and time-consuming, this work introduces a scalable polarization-aided approach that employs cost-effective acquisition tools. By attaching a linear polarizer to readily available RGB cameras, multi-view polarization images can be captured without the need for advance calibration or precise measurements of the polarizer angle, substantially reducing system construction costs. The proposed approach represents polarimetric BRDF, Stokes vectors, and polarization states of object surfaces as neural implicit fields. These fields, combined with the polarizer angle, are retrieved by optimizing the rendering loss of input polarized images. By leveraging fundamental physical principles for the implicit representation of polarization rendering, our method demonstrates superiority over existing techniques through experiments on public datasets and real captured images. Code and captured data will be made available upon acceptance for further evaluation.
Poster
Wei Xu · Charlie Wagner · Junjie Luo · Qi Guo
[ ExHall D ]
Abstract
Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and blurriness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be analytically calculated using a novel DfD relation we derive. Our experiment shows that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
Poster
Xiaoyan Xing · Konrad Groh · Sezer Karaoglu · Theo Gevers · Anand Bhattad
[ ExHall D ]
Abstract
We introduce LUMINET, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LUMINET synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy to enhance training efficiency for effective relighting, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning.Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LUMINET processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.
Poster
Chao Wang · Zhihao Xia · Thomas Leimkuehler · Karol Myszkowski · Xuaner Zhang
[ ExHall D ]
Abstract
While consumer displays increasingly support more than 10 stops of dynamic range, most image assets — such as internet photographs and generative AI content — remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
Poster
Chih-Hao Lin · Jia-Bin Huang · Zhengqin Li · Zhao Dong · Christian Richardt · Michael Zollhoefer · Tuotuo Li · Johannes Kopf · Shenlong Wang · Changil Kim
[ ExHall D ]
Abstract
Inverse rendering seeks to recover 3D geometry, surface material, and lighting from captured images, enabling advanced applications such as novel-view synthesis, relighting, and virtual object insertion. However, most existing techniques rely on high dynamic range (HDR) images as input, limiting accessibility for general users. In response, we introduce IRIS, an inverse rendering framework that recovers the physically based material, spatially-varying HDR lighting, and camera response functions from multi-view, low-dynamic-range (LDR) images. By eliminating the dependence on HDR input, we make inverse rendering technology more accessible. We evaluate our approach on real-world and synthetic scenes and compare it with state-of-the-art methods. Our results show that IRIS effectively recovers HDR lighting, accurate material, and plausible camera response functions, supporting photorealistic relighting and object insertion.
Poster
Hoon-Gyu Chung · Seokjun Choi · Seung-Hwan Baek
[ ExHall D ]
Abstract
Inverse rendering seeks to reconstruct both geometry and spatially varying BRDFs (SVBRDFs) from captured images. To address the inherent ill-posedness of inverse rendering, basis BRDF representations are commonly used, modeling SVBRDFs as spatially varying blends of a set of basis BRDFs. However, existing methods often yield basis BRDFs that lack intuitive separation and have limited scalability to scenes of varying complexity.In this paper, we introduce a differentiable inverse rendering method that produces interpretable basis BRDFs. Our approach models a scene using 2D Gaussians, where the reflectance of each Gaussian is defined by a weighted blend of basis BRDFs.We efficiently render an image from the 2D Gaussians and basis BRDFs using differentiable rasterization and impose a rendering loss with the input images.During this analysis-by-synthesis optimization process of differentiable inverse rendering, we dynamically adjust the number of basis BRDFs to fit the target scene while encouraging sparsity in the basis weights. This ensures that the reflectance of each Gaussian is represented by only a few basis BRDFs.This approach enables the reconstruction of accurate geometry and interpretable basis BRDFs that are spatially separated. Consequently, the resulting scene representation, comprising basis BRDFs and 2D Gaussians, supports physically-based novel-view relighting and intuitive scene editing.
Poster
Samuel Rota Bulò · Lorenzo Porzi · Nemanja Bartolovic · Peter Kontschieder
[ ExHall D ]
Abstract
We present a novel, hardware rasterized rendering approach for ray-based 3D Gaussian Splatting (RayGS), obtaining both fast and high-quality results for novel view synthesis. Our work contains a mathematically rigorous and geometrically intuitive derivation about how to efficiently estimate all relevant quantities for rendering RayGS models, structured with respect to standard hardware rasterization shaders. Our solution is the first enabling rendering RayGS models at sufficiently high frame rates to support quality-sensitive applications like Virtual and Mixed Reality. Our second contribution enables alias-free rendering for RayGS, by addressing mip-related issues arising when rendering diverging scales during training and testing. We demonstrate significant performance gains, across different benchmark scenes, while retaining state-of-the-art appearance quality of RayGS.
Poster
Chun Gu · Xiaofei Wei · Li Zhang · Xiatian Zhu
[ ExHall D ]
Abstract
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and …
Poster
Zhengqin Li · Dilin Wang · Ka chen · Zhaoyang Lv · Thu Nguyen-Phuoc · Milim Lee · Jia-Bin Huang · Lei Xiao · Yufeng Zhu · Carl Marshall · Carl Ren · Richard Newcombe · Zhao Dong
[ ExHall D ]
Abstract
We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compared favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
Poster
Yutao Feng · Xiang Feng · Yintong Shang · Ying Jiang · Chang Yu · Zeshun Zong · Tianjia Shao · Hongzhi Wu · Kun Zhou · Chenfanfu Jiang · Yin Yang
[ ExHall D ]
Abstract
We demonstrate the feasibility of integrating physics-based animations of solids and fluids with 3D Gaussian Splatting (3DGS) to create novel effects in virtual scenes reconstructed using 3DGS. Leveraging the coherence of the Gaussian Splatting and Position-Based Dynamics (PBD) in the underlying representation, we manage rendering, view synthesis, and the dynamics of solids and fluids in a cohesive manner. Similar to GaussianShader, we enhance each Gaussian kernel with an added normal, aligning the kernel's orientation with the surface normal to refine the PBD simulation. This approach effectively eliminates spiky noises that arise from rotational deformation in solids. It also allows us to integrate physically based rendering to augment the dynamic surface reflections on fluids. Consequently, our framework is capable of realistically reproducing surface highlights on dynamic fluids and facilitating interactions between scene objects and fluids from new views.
Poster
Aditya Chetan · Guandao Yang · Zichen Wang · Steve Marschner · Bharath Hariharan
[ ExHall D ]
Abstract
Neural fields have become widely used in various fields, from shape representation to neural rendering, and for solving partial differential equations (PDEs). With the advent of hybrid neural field representations like Instant NGP that leverage small MLPs and explicit representations, these models train quickly and can fit large scenes. Yet in many applications like rendering and simulation, hybrid neural fields can cause noticeable and unreasonable artifacts. This is because they do not yield accurate spatial derivatives needed for these downstream applications. In this work, we propose two ways to circumvent these challenges. Our first approach is a post hoc operator that uses local polynomial fitting to obtain more accurate derivatives from pre-trained hybrid neural fields. Additionally, we also propose a self-supervised fine-tuning approach that refines the hybrid neural field to yield accurate derivatives directly while preserving the initial signal. We show applications of our method to rendering, collision simulation, and solving PDEs. We observe that using our approach yields more accurate derivatives, reducing artifacts and leading to more accurate simulations in downstream applications.
Poster
Feixiang He · Jiangbei Yue · Jialin Zhu · Armin Seyfried · Dan Casas · Julien Pettré · He Wang
[ ExHall D ]
Abstract
Video-based high-density crowd analysis and prediction has been a long-standing topic in computer vision. It is notoriously difficult due to, but not limited to, the lack of high-quality data and complex crowd dynamics. Consequently, it has been relatively under studied. In this paper, we propose a new approach that aims to learn from in-the-wild videos, often with low quality where it is difficult to track individuals or count heads. The key novelty is a new physics prior to model crowd dynamics. We model high-density crowds as active matter, a continumm with active particles subject to stochastic forces, named `crowd material'. Our physics model is combined with neural networks, resulting in a neural stochastic differential equation system which can mimic the complex crowd dynamics. Due to the lack of similar research, we adapt a range of existing methods which are close to ours, and other possible alternatives for comparison. Through exhaustive evaluation, we show our model outperforms existing methods in analyzing and forecasting extremely high-density crowds. Furthermore, since our model is a continuous-time physics model, it can be used for simulation and analysis, providing strong interpretability. This is categorically different from most deep learning methods, which are discrete-time models and black-boxes.
Poster
Bojun Xiong · Jialun Liu · JiaKui Hu · Chenming Wu · Jinbo Wu · Xing Liu · Chen Zhao · Errui Ding · Zhouhui Lian
[ ExHall D ]
Abstract
Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models will be …
Poster
Guoxing Sun · Rishabh Dabral · Heming Zhu · Pascal Fua · Christian Theobalt · Marc Habermann
[ ExHall D ]
Abstract
Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or they ignore relevant sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address this issue, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.
Poster
Zhipeng Huang · Wangbo Yu · Xinhua Cheng · ChengShu Zhao · Yunyang Ge · Mingyi Guo · Li Yuan · Yonghong Tian
[ ExHall D ]
Abstract
Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present **RoomPainter**, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (**MVIS**) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the **MVIS**, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (**MVRS**) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
Poster
Wei Cheng · Juncheng Mu · Xianfang Zeng · Xin Chen · Anqi Pang · Chi Zhang · Zhibin Wang · Bin Fu · Gang Yu · Ziwei Liu · Liang Pan
[ ExHall D ]
Abstract
Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D …
Poster
Qiao Yu · Xianzhi Li · Yuan Tang · Xu Han · Long Hu · yixue Hao · Min Chen
[ ExHall D ]
Abstract
Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
Poster
Nissim Maruani · Wang Yifan · Matthew Fisher · Pierre Alliez · Mathieu Desbrun
[ ExHall D ]
Abstract
This paper proposes a new 3D generative model that learns to synthesize shape variations based on a single example. While generative methods for 3D objects have recently attracted much attention, current techniques often lack geometric details and/or require long training times and large resources. Our approach remedies these issues by combining sparse voxel grids and multiscale point, normal, and color sampling within an encoder-free neural architecture that can be trained efficiently and in parallel. We show that our resulting variations better capture the fine details of their original input and can capture more general types of surfaces than previous SDF-based methods. Moreover, we offer interactive generation of 3D shape variants, allowing more human control in the design loop if needed.
Poster
Daoyi Gao · Mohd Yawar Nihal Siddiqui · Lei Li · Angela Dai
[ ExHall D ]
Abstract
Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages.First, we generate a high-level articulation-aware object structure; then, based on this structural information, we synthesize each part's mesh faces. Key to our approach is modeling both articulation structures and part meshes as sequences of quantized triangle embeddings, leading to a unified hierarchical framework with transformers for autoregressive generation.Object part structures are first generated as their bounding primitives and articulation modes; a second transformer, guided by these articulation structures, then generates each part's mesh triangles.To ensure coherency among generated parts, we introduce structure-guided conditioning that also incorporates local part mesh connectivity.MeshArt shows significant improvements over state of the art, with 57.1% improvement in structure coverage and a 209-point improvement in mesh generation FID.
Poster
Aleksei Bokhovkin · Quan Meng · Shubham Tulsiani · Angela Dai
[ ExHall D ]
Abstract
We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.
Poster
Ziya Erkoc · Can Gümeli · Chaoyang Wang · Matthias Nießner · Angela Dai · Peter Wonka · Hsin-Ying Lee · Peiye Zhuang
[ ExHall D ]
Abstract
We propose a training-free approach to 3D editing that enables the editing of a single shape and the reconstruction of a mesh within a few minutes. Leveraging 4-view images, user-guided text prompts, and rough 2D masks, our method produces an edited 3D mesh that aligns with the prompt. For this, our approach performs synchronized multi-view image editing in 2D. However, targeted regions to be edited are ambiguous due to projection from 3D to 2D. To ensure precise editing only in intended regions, we develop a 3D segmentation pipeline that detects edited areas in 3D space. Additionally, we introduce a merging algorithm to seamlessly integrate edited 3D regions with original input. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling fast, high-quality editing while preserving unintended regions.
Poster
Quan Meng · Lei Li · Matthias Nießner · Angela Dai
[ ExHall D ]
Abstract
We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
Poster
Yian Zhao · Wanshi Xu · Ruochong Zheng · Pengchong Qiao · Chang Liu · Jie Chen
[ ExHall D ]
Abstract
The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation.However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results.Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation.To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose **i**nteractive **Seg**ment-and-**Man**ipulate 3D Gaussians (**iSegMan**), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view.To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (**EIP**), which innovatively exploits epipolar constraint for efficient and robust interaction matching.To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (**VGV**), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility.Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a **Manipulation Toolbox** to implement various functions on selected regions, enhancing the …
Poster
Jianxiong Shen · Yue Qian · Xiaohang Zhan
[ ExHall D ]
Abstract
Current 3D Gaussian Splatting methods often overlook structured information, leading to disorganized 3D Gaussians that compromise memory and structure efficiency, particularly in Level-of-Detail (LOD) management. This inefficiency results in excessive memory use, limiting their application in resource-sensitive environments like virtual reality and interactive simulations. To overcome these challenges, we introduce a scalable Gaussian Soup that enables high-performance LOD management with progressively reduced memory usage. Our method utilizes triangle primitives with Gaussian splats embedded and adaptive pruning/growing strategies to ensure high-quality scene rendering with significantly reduced memory demands. By embedding neural Gaussians within the triangle primitives through the triangle-MLP, we achieve further memory savings while maintaining rendering fidelity. Experimental results demonstrate that our approach achieves consistently superior performance than recent leading techniques across various LOD while progressively reducing memory usage.
Poster
Yifei Liu · Zhihang Zhong · Yifan Zhan · Sheng Xu · Xiao Sun
[ ExHall D ]
Abstract
While 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and real-time rendering, the high memory consumption due to the use of millions of Gaussians limits its practicality. To mitigate this issue, improvements have been made by pruning unnecessary Gaussians, either through a hand-crafted criterion or by using learned masks. However, these methods deterministically remove Gaussians based on a snapshot of the pruning moment, leading to sub-optimized reconstruction performance from a long-term perspective. To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. Hence, the importance of Gaussians iteratively changes and the pruned Gaussians are selected diversely. Various experimental results demonstrate the superiority of the proposed method in achieving better rendering quality with fewer Gaussians, compared to previous pruning methods.
Poster
Kun Yang · Yuxiang Liu · Zeyu Cui · Yu Liu · Maojun Zhang · Shen Yan · Qing Wang
[ ExHall D ]
Abstract
Thermal infrared imaging offers the advantage of all-weather capability, enabling non-intrusive measurement of an object's surface temperature. Consequently, thermal infrared images are employed to reconstruct 3D models that accurately reflect the temperature distribution of a scene, aiding in applications such as building monitoring and energy management. However, existing approaches predominantly focus on static 3D reconstruction for a single time period, overlooking the impact of environmental factors on thermal radiation and failing to predict or analyze temperature variations over time. To address these challenges, we propose the NTR-Gaussian method, which treats temperature as a form of thermal radiation, incorporating elements like convective heat transfer and radiative heat dissipation. Our approach utilizes neural networks to predict thermodynamic parameters such as emissivity, convective heat transfer coefficient, and heat capacity. By integrating these predictions, we can accurately forecast thermal temperatures at various times throughout a nighttime scene. Furthermore, we introduce a dynamic dataset specifically for nighttime thermal imagery. Extensive experiments and evaluations demonstrate that NTR-Gaussian significantly outperforms comparison methods in thermal reconstruction, achieving a predicted temperature error within 1 degree Celsius.
Poster
Yexing Xu · Longguang Wang · Minglin Chen · Sheng Ao · Li Li · Yulan Guo
[ ExHall D ]
Abstract
Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under spare inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU.
Poster
Yecong Wan · Mingwen Shao · Yuanshuo Cheng · Wangmeng Zuo
[ ExHall D ]
Abstract
In this paper, we aim ambitiously for a realistic yet challenging problem, namely, how to reconstruct high-quality 3D scenes from sparse low-resolution views that simultaneously suffer from deficient perspectives and contents. Whereas existing methods only deal with either sparse views or low-resolution observations, they fail to handle such hybrid and complicated scenarios. To this end, we propose a novel Sparse-view Super-resolution 3D Gaussian Splatting framework, dubbed S2Gaussian, that can reconstruct structure-accurate and detail-faithful 3D scenes with only sparse and low-resolution views. The S2Gaussian operates in a two-stage fashion. In the first stage, we initially optimize a low-resolution Gaussian representation with depth regularization and densify it to initialize the high-resolution Gaussians through a tailored Gaussian Shuffle operation. In the second stage, we refine the high-resolution Gaussians with the super-resolved images generated from both original sparse views and pseudo-views rendered by the low-resolution Gaussians. In which a customized blur-free inconsistency modeling scheme and a 3D robust optimization strategy are elaborately designed to mitigate multi-view inconsistency and eliminate erroneous updates caused by imperfect supervision. Extensive experiments demonstrate superior results and in particular establishing new state-of-the-art performances with more consistent geometry and finer details.
Poster
Yihao Wang · Marcus Klasson · Matias Turkulainen · Shuzhe Wang · Juho Kannala · Arno Solin
[ ExHall D ]
Abstract
Gaussian splatting enables fast novel-view synthesis in static 3D environments. However, reconstructing real-world environments remains challenging as distractors or occluders break the multi-view consistency assumption required for good 3D reconstruction. Most existing methods for unconstrained novel-view synthesis rely on external semantic information from foundation models, which introduces additional computational overhead as pre-processing steps or during optimization. In this work, we propose a novel method that directly separates occluders and static scene elements purely based on volume-rendering of Gaussian primitives. We model occluders in training images as two-dimensional hallucinations embedded in the camera views and apply alpha-blending in a back-to-front manner in which a 3D scene and 2D occluders are explicitly modelled within the alpha-compositing stages. Our approach implicitly separates the scene into distinct static and distractor elements achieving comparable results to prior approaches with only photometric supervision. We demonstrate the effectiveness of our approach on in-the-wild datasets.
Poster
Zhihao Liu · Zhanglin Cheng · Naoto Yokoya
[ ExHall D ]
Abstract
Obtaining high-quality, practically usable 3D models of biological plants remains a significant challenge in computer vision and graphics. In this paper, we present a novel method for generating realistic 3D plant models from single-view photographs. Our approach employs a neural decomposition technique to learn a lightweight hierarchical box representation from the image, effectively capturing the structures and botanical features of plants. Then, this representation can be subsequently refined through a shape-guided parametric modeling module to produce complete 3D plant models. By combining hierarchical learning and parametric modeling, our method generates structured 3D plant assets with fine geometric details. Notably, through learning the decomposition in different levels of detail, our method can adapt to two distinct plant categories: outdoor trees and houseplants, each with unique appearance features. Within the scope of plant modeling, our method is the first comprehensive solution capable of reconstructing both plant categories from single-view images.
Poster
Xiang Li · Zixuan Huang · Anh Thai · James Rehg
[ ExHall D ]
Abstract
Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.
Poster
Zhao Dong · Ka chen · Zhaoyang Lv · Hong-Xing Yu · Yunzhi Zhang · Cheng Zhang · Yufeng Zhu · Stephen Tian · Zhengqin Li · Geordie Moffatt · Sean Christofferson · James Fort · Xiaqing Pan · Mingfei Yan · Jiajun Wu · Carl Ren · Richard Newcombe
[ ExHall D ]
Abstract
We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. We will make the full dataset …
Poster
Vitor Guizilini · Zubair Irshad · Dian Chen · Greg Shakhnarovich · Rares Andrei Ambrus
[ ExHall D ]
Abstract
Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.
Poster
Lingen Li · Zhaoyang Zhang · Yaowei Li · Jiale Xu · Wenbo Hu · Xiaoyu Li · Weihao Cheng · Jinwei Gu · Tianfan Xue · Ying Shan
[ ExHall D ]
Abstract
Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods still depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views.In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from pretrained dense stereo models during training.Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems.
Poster
Jingyu Lin · Jiaqi Gu · Lubin Fan · Bojian Wu · Yujing Lou · Renjie Chen · Ligang Liu · Jieping Ye
[ ExHall D ]
Abstract
Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.
Poster
Hanwen Liang · Junli Cao · Vidit Goel · Guocheng Qian · Sergei Korolev · Demetri Terzopoulos · Konstantinos N. Plataniotis · Sergey Tulyakov · Jian Ren
[ ExHall D ]
Abstract
This paper addresses a challenging question: how can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image?Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality, and distorted reconstructions for unseen areas. We propose a novel pipeline to overcome these limitations.Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splatting, even from a single-condition image, in a feed-forward manner. The video diffusion model is designed to precisely follow a specified camera trajectory, allowing it to generate compressed latents that contain multi-view information while maintaining 3D consistency.We further train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the generation of high-quality, wide-scope, and generic 3D scenes.Extensive evaluations on various datasets show that our model significantly outperforms existing methods for single-view 3D rendering, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
Poster
Zhen Lv · Yangqi Long · Congzhentao Huang · Cao Li · Chengfei Lv · Hao Ren · Dian Zheng
[ ExHall D ]
Abstract
Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods mainly handle these problems by directly applying novel view synthesis (NVS) methods to video, which is naturally unsuitable. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.
Poster
Yunzhi Yan · Zhen Xu · Haotong Lin · Haian Jin · Haoyu Guo · Yida Wang · Kun Zhan · XianPeng Lang · Hujun Bao · Xiaowei Zhou · Sida Peng
[ ExHall D ]
Abstract
This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensors data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes,but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control.Moreover, the utilization of pixel-level LiDAR condition allows us to make accurate pixel-level edits to target scenes.In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open and Pandaset datasets demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.
Poster
Jiadong Tang · Yu Gao · Dianyi Yang · Liqi Yan · Yufeng Yue · Yi Yang
[ ExHall D ]
Abstract
Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.
Poster
Cong Ruan · Yuesong Wang · Bin Zhang · Lili Ju · Tao Guan
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has shown impressive performance in scene reconstruction, offering high rendering quality and rapid rendering speed with short training time. However, it often yields unsatisfactory results when applied to indoor scenes due to its poor ability to learn geometries without enough textural information. In this paper, we propose a new 3DGS-based method "IndoorGS", that leverages the commonly found yet important geometric cues in indoor scenes to improve the reconstruction quality. Specifically, we first extract 2D lines from input images and fuse them into 3D line cues via feature-based matching, which can provide a structural understanding of the target scene. We then apply the statistical outlier removal to refine Structure-from-Motion (SfM) points, ensuring robust cues in texture-rich areas. Based on these two types of cues, we further extract reliable 3D plane cues for textureless regions. Such geometric information will be utilized not only for initialization but also in the realization of a geometric-cue-guided adaptive density control (ADC) strategy. The proposed ADC approach is grounded in the principle of divide-and-conquer and optimizes the use of each type of geometric cues to enhance overall reconstruction performance. Extensive experiments on multiple indoor datasets show that our method can deliver much more …
Poster
Xiaohao Xu · Feng Xue · Shibo Zhao · Yike Pan · Sebastian Scherer · Xiaonan Huang
[ ExHall D ]
Abstract
Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency.To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction via Multi-Agent Gaussian Consensus. MAC-Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra-Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter-Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi-agent Gaussian splats, seamlessly integrates them into a high-fidelity 3D model. Leveraging Gaussian primitives, MAC-Ego3D supports efficient RGB-D rendering, enabling rapid inter-agent Gaussian association and alignment.MAC-Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real-world benchmarks, achieving a 15× increase in inference speed, order-of-magnitude reductions in ego-motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available.
Poster
Sangmin Kim · Seunguk Do · Jaesik Park
[ ExHall D ]
Abstract
Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present VirtualPCR, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room (PCR). In VirtualPCR, a 3DLocator module locates recovered actors on the stage using depth prior, and it estimates unseen human poses via interpolation. The proposed ShotMatcher module tracks the actors under shot changes. VirtualPCR has a face-fitting branch that dynamically recover the actors’ expressions. Experiments on the Sitcoms3D dataset show that our pipeline can reassemble TV shows with new cameras at different time steps. We also show that VirtualPCR offers interesting applications, such as scene editing, synthetic shot-making, relocating actors, and human pose manipulation.
Poster
Qiang Hu · Zihan Zheng · Houqiang Zhong · Sihua Fu · Li Song · Xiaoyun Zhang · Guangtao Zhai · Yanfeng Wang
[ ExHall D ]
Abstract
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD) trade-off during training, leading to performance degradation and increased model redundancy. To address this gap, we propose 4DGC, a novel rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV. Specifically, 4DGC introduces a motion-aware dynamic Gaussian representation that utilizes a compact motion grid combined with sparse compensated Gaussians to exploit inter-frame similarities. This representation effectively handles large motions, preserving quality and reducing temporal redundancy. Furthermore, we present an end-to-end compression scheme that employs differentiable quantization and a tiny implicit entropy model to compress the motion grid and compensated Gaussians efficiently. The entire framework is jointly optimized using a rate-distortion trade-off. Extensive experiments demonstrate that 4DGC supports variable bitrates and consistently outperforms existing methods in RD performance across multiple datasets.
Poster
Yiming Liang · Tianhan Xu · Yuta Kikuchi
[ ExHall D ]
Abstract
We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives that is capable of achieving high-quality monocular dynamic 3D reconstruction. HiMoR's foundation is the insight that motions in everyday scenes can be decomposed into coarser motion, which forms the basis for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes depicting coarse motion for temporal smoothness and deeper nodes denoting finer motion. Additionally, our model uses a few shared motion bases to represent motions of different set of nodes, aligning with the assumption of motion being generally low-rank. This motion representation design provides Gaussians with a more rational deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as a substitute, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to effectively reflect true reconstruction quality. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.
Poster
Siyuan Shen · Tianjia Shao · Kun Zhou · Chenfanfu Jiang · Yin Yang
[ ExHall D ]
Abstract
This paper presents a novel pipeline named EnliveningGS, which enables active locomotion of 3D models represented with 3D Gaussian splatting (3DGS). We are inspired by the fact that real-world lives pose their bodies in a natural and physically meaningful manner by compressing or elongating muscle fibers embedded in the body. EnliveningGS aims to replicate the similar functionality of 3DGS models so that the object within a 3DGS scene acts like a living creature rather than a static shape --- they walk, jump, and twist in the scene under provided motion trajectories driven by muscle activations. While the concept is straightforward, many challenging technical difficulties need to be taken care of. Synthesizing realistic locomotion of a 3DGS model embodies an inverse physics problem of very high dimensions. The core challenge is how to efficiently and robustly model frictional contacts between an enlivened model'' and the environment, as it is the composition of contact/collision/friction forces triggered by muscle activation that generates the final movement of the object. We propose a hybrid numerical method mixing LCP and penalty method to tackle this NP-hard problem robustly. Our pipeline also addresses the limitation of existing 3DGS deformation algorithms and inpainting the missing information when models …
Poster
Hongye Cheng · Tianyu Wang · guangsi shi · Zexing Zhao · Yanwei Fu
[ ExHall D ]
Abstract
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation.
Poster
Haolin Liu · Xiaohang Zhan · Zizheng Yan · Zhongjin Luo · Yuxin Wen · Xiaoguang Han
[ ExHall D ]
Abstract
Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in more challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.
Poster
Bohan Yu · Jinxiu Liang · Zhuofeng Wang · Bin Fan · Art Subpaasa · Boxin Shi · Imari Sato
[ ExHall D ]
Abstract
Hyperspectral imaging plays a critical role in numerous scientific and industrial fields. Conventional hyperspectral imaging systems often struggle with the trade-off between capture speed, spectral resolution, and bandwidth, particularly in dynamic environments. In this work, we present a novel event-based active hyperspectral imaging system designed for real-time capture with low bandwidth in dynamic scenes. By combining an event camera with a dynamic illumination strategy, our system achieves unprecedented temporal resolution while maintaining high spectral fidelity, all at a fraction of the bandwidth requirements of traditional systems. Unlike basis-based methods that sacrifice spectral resolution for efficiency, our approach enables continuous spectral sampling through an innovative sweeping rainbow" illumination pattern synchronized with a rotating mirror array. The key insight is leveraging the sparse, asynchronous nature of event cameras to encode spectral variations as temporal contrasts, effectively transforming the spectral reconstruction problem into a series of geometric constraints. Extensive evaluations on both synthetic and real data demonstrate that our system outperforms state-of-the-art methods in temporal resolution while maintaining competitive spectral reconstruction quality.
Poster
Yaniv Benny · Lior Wolf
[ ExHall D ]
Abstract
This paper proposes a novel method for omnidirectional 360\degree perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360\degree perception benchmarks for depth estimation and semantic segmentation. Our code is attached as supplementary material.
Poster
Huan Zheng · Wencheng Han · Jianbing Shen
[ ExHall D ]
Abstract
Recovering high-quality depth maps from compressed sources has gained significant attention due to the limitations of consumer-grade depth cameras and the bandwidth restrictions during data transmission.However, current methods still suffer from two challenges.First, bit-depth compression produces a uniform depth representation in regions with subtle variations, hindering the recovery of detailed information.Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene.To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately.To be specific, we propose the fine geometry detail encoder (FGDE), which is designed to aggregate fine geometry details in high-resolution low-level image features while simultaneously enriching them with complementary information from low-resolution context-level image features.In addition, we develop the global geometry encoder (GGE) that aims at suppressing noise and extracting global geometric information effectively via constructing compact feature representation in a low-rank space.We conduct experiments on multiple benchmark datasets, demonstrating that our GDNet significantly outperforms current methods in terms of geometric consistency and detail recovery. Our codes will be available.
Poster
Hongkai Lin · Dingkang Liang · Zhenghao Qi · Xiang Bai
[ ExHall D ]
Abstract
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for comprehensively understanding underwater scenes.Nevertheless, high-quality and large-scale underwater datasets with dense annotations remained scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model.The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations.We synthesize a large underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks.The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations.Our method can offer new perspectives on alleviating data scarcity issues in other fields.
Poster
Jianing Li · Yunjian Zhang · Haiqian Han · Xiangyang Ji
[ ExHall D ]
Abstract
Conventional frame-based imaging for active stereo systems has encountered major challenges in fast-motion scenarios. However, how to design a novel paradigm for high-speed depth sensing still remains an open issue. In this paper, we propose a novel problem setting, namely active stereo event-based vision, which first integrates binocular event cameras and an infrared projector for high-speed depth sensing. Technically, we first build a stereo camera prototype system and present a real-world dataset with over 21.5k spatiotemporal synchronized labels at 15 Hz, while also creating a realistic synthetic dataset with stereo event streams and 23.8k synchronized labels at 20 Hz. Then, we propose ActiveEventNet, a lightweight yet effective active event-based stereo matching neural network that learns to generate high-quality dense disparity maps from stereo event streams with low latency. Experiments demonstrate that our ActiveEventNet outperforms state-of-the-art methods meanwhile significantly reducing computational complexity. Our solution offers superior depth sensing compared to conventional stereo cameras in high-speed scenarios, while also achieving the inference speed of up to 150 FPS with our prototype. We believe that this novel paradigm will provide new insights into future depth sensing systems. Our dataset descriptions and open-source code will be available in the supplemental material.
Poster
Zidong Cao · Jinjing Zhu · Weiming Zhang · Hao Ai · Haotian Bai · Hengshuang Zhao · Lin Wang
[ ExHall D ]
Abstract
Recently, Depth Anything Models (DAMs) - a type of depth foundation models - have demonstrated impressive zero-shot capabilities across diverse perspective images. Despite its success, it remains an open question regarding DAMs' performance on panorama images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To address this gap, we conduct an empirical analysis to evaluate the performance of DAMs on panoramic images and identify their limitations. For this, we undertake comprehensive experiments to assess the performance of DAMs from three key factors: panoramic representations, 360 camera positions for capturing scenarios, and spherical spatial transformations. This way, we reveal some key findings, e.g., DAMs are sensitive to spatial transformations. We then propose a semi-supervised learning (SSL) framework to learn a panoramic DAM, dubbed PanDA. Under the umbrella of SSL, PanDA first learns a teacher model by fine-tuning DAM through joint training on synthetic indoor and outdoor panoramic datasets. Then, a student model is trained using large-scale unlabeled data, leveraging pseudo-labels generated by the teacher model. To enhance PanDA's generalization capability, M\"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the predicted depth maps from the original and spatially transformed ones. This subtly improves the student …
Poster
Xunzhi Zheng · Dan Xu
[ ExHall D ]
Abstract
Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our code and models will be made publicly available upon acceptance.
Poster
Jiaxi Deng · Yushen Wang · Haitao Meng · Zuoxun Hou · Yi Chang · Gang Chen
[ ExHall D ]
Abstract
Fast and reliable omnidirectional 3D sensing is essential to many applications such as autonomous driving, robotics and drone navigation. While many well-recognized methods have been developed to produce high-quality omnidirectional 3D information, they are too slow for real-time computation, limiting their feasibility in practical applications. Motivated by these shortcomings, we propose an efficient omnidirectional depth sensing framework, called OmniStereo, which generates high-quality 3D information in real-time. Unlike prior works, OmniStereo employs Cassini projection to simplify the photometric matching and introduces a lightweight stereo matching network to minimize computational overhead. Additionally, OmniStereo proposes a novel fusion method to handle depth discontinuities and invalid pixels complemented by a refinement module to reduce mapping-introduced errors and recover fine details. As a result, OmniStereo achieves state-of-the-art (SOTA) accuracy, surpassing the second-best method over 32\% in MAE, while maintaining real-time efficiency. It operates more than 16.5× faster than the second-best method in accuracy on TITAN RTX, achieving 12.3 FPS on embedded device Jetson AGX Orin, underscoring its suitability for real-world deployment. The code will be open-sourced upon acceptance of the paper.
Poster
Luca Bartolomei · Fabio Tosi · Matteo Poggi · Stefano Mattoccia
[ ExHall D ]
Abstract
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
Poster
Luigi Piccinelli · Christos Sakaridis · Mattia Segu · Yung-Hsu Yang · Siyuan Li · Wim Abbeloos · Luc Van Gool
[ ExHall D ]
Abstract
Monocular 3D estimation is crucial for visual perception.However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images.These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss.To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models.Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics.We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras.A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models will be made public.
Poster
Yihan Wang · Linfei Pan · Marc Pollefeys · Viktor Larsson
[ ExHall D ]
Abstract
In this paper, we present a new Structure-from-Motion pipeline that uses a non-parametric camera projection model. The model is self-calibrated during the reconstruction process and can fit a wide variety of cameras, ranging from simple low-distortion pinhole cameras to more extreme optical systems such as fisheye or catadioptric cameras. The key component in our framework is an adaptive calibration procedure that can estimate partial calibrations, only modeling regions of the image where sufficient constraints are available. In experiments, we show that our method achieves comparable accuracy in scenarios where traditional Structure-from-Motion pipelines excel and it outperforms its parametric opponents in cases where they are unable to self-calibrate their parametric models.
Poster
Yohann Cabon · Lucas Stoffl · Leonid Antsfeld · Gabriela Csurka · Boris Chidlovskii · Jerome Revaud · Vincent Leroy
[ ExHall D ]
Abstract
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUNSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and SLAM scenarios showing state-of-the-art …
Poster
Hana Bezalel · Dotan Ankri · Ruojin Cai · Hadar Averbuch-Elor
[ ExHall D ]
Abstract
We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods. We will release our data, code, and trained models.
Poster
Wonbong Jang · Philippe Weinzaepfel · Vincent Leroy · Lourdes Agapito · Jerome Revaud
[ ExHall D ]
Abstract
We present Pow3R, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3R incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time.This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion.Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3R at exploiting all available information.
Poster
Maxime Pietrantoni · Gabriela Csurka · Torsten Sattler
[ ExHall D ]
Abstract
Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Localization is solved through pose refinement by aligning feature maps or segmentations from a query image and feature maps or segmentations rendered from the GSFFs scene representation. The resulting privacy- and non-privacy-preserving localization pipelines are evaluated on multiple real world datasets, and outperform state-of-the-art baselines.
Poster
Jonathan Astermark · Anders Heyden · Viktor Larsson
[ ExHall D ]
Abstract
In this paper, we speed up robust estimation from dense two-view correspondences. Previous work has shown that dense matchers can significantly improve both the accuracy and robustness of two-view relative pose estimation. However, the large number of matches comes with a significantly increased runtime during robust estimation. To avoid this, we propose an efficient match summarization scheme which provides comparable accuracy to using the full set of dense matches, while having 10-100x faster runtime. We validate our approach on standard benchmark datasets together with multiple state-of-the-art dense matchers.
Poster
Honggyu An · Jin Hyeon Kim · Seonghoon Park · Sunghwan Hong · Jaewoo Jung · Jisang Han · Seungryong Kim
[ ExHall D ]
Abstract
In this work, we analyze new aspects of cross-view completion, mainly through the analogy of cross-view completion and traditional self-supervised correspondence learning algorithms. Based on our analysis, we reveal that the cross-attention map of Croco-v2, best reflects this correspondence information compared to other correlations from the encoder or decoder features. We further verify the effectiveness of the cross-attention map by evaluating on both zero-shot and supervised dense geometric correspondence and multi-frame depth estimation.
Poster
David Yifan Yao · Albert J. Zhai · Shenlong Wang
[ ExHall D ]
Abstract
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing large visual models for 4D understanding.
Poster
Yuzhen Liu · Qiulei Dong
[ ExHall D ]
Abstract
Relative camera pose estimation between two images is a fundamental task in 3D computer vision. Recently, many relative pose estimation networks have been explored for learning a mapping from two input images to their corresponding relative pose, however, the estimated relative poses by these methods do not have the intrinsic Pose Permutation Equivariance (PPE) property: the estimated relative pose from Image A to Image B should be the inverse of that from Image B to Image A. It means that permuting the input order of two images would cause these methods to obtain inconsistent relative poses. To address this problem, we firstly introduce the concept of PPE mapping, which indicates such a mapping that captures the intrinsic PPE property of relative poses. Then by enforcing the aforementioned PPE property, we propose a general framework for relative pose estimation, called EquiPose, which could easily accommodate various relative pose estimation networks in literature as its baseline models. We further theoretically prove that the proposed EquiPose framework could guarantee that its obtained mapping is a PPE mapping. Given a pre-trained baseline model, the proposed EquiPose framework could improve its performance even without fine-tuning, and could further boost its performance with fine-tuning. Experimental results …
Poster
Krispin Wandel · Hesheng Wang
[ ExHall D ]
Abstract
Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%.
Poster
Yufu Wang · Yu Sun · Priyanka Patel · Kostas Daniilidis · Michael J. Black · Muhammed Kocabas
[ ExHall D ]
Abstract
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like face or body bounding boxes, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS …
Poster
Yalong Xu · Lin Zhao · Chen Gong · Guangyu Li · Di Wang · Nannan Wang
[ ExHall D ]
Abstract
Top-down approaches for human pose estimation (HPE) have reached a high level of sophistication, exemplified by models such as HRNet and ViTPose. Nonetheless, the low efficiency of top-down methods is a recognized issue that has not been sufficiently explored in current research. Our analysis suggests that the primary cause of inefficiency stems from the substantial diversity found in pose samples. On one hand, simple poses can be accurately estimated without requiring the computational resources of larger models. On the other hand, a more prominent issue arises from the abundance of bounding boxes, which remain excessive even after NMS. In this paper, we present a straightforward yet effective dynamic framework called DynPose, designed to match diverse pose samples with the most appropriate models, thereby ensuring optimal performance and high efficiency. Specifically, the framework contains a lightweight router and two pre-trained HPE models: one small and one large. The router is optimized to classify samples and dynamically determine the appropriate inference paths. Extensive experiments demonstrate the effectiveness of the framework. For example, using ResNet-50 and HRNet-W32 as the pre-trained models, our DynPose achieves an almost 50\% increase in speed over HRNet-W32 while maintaining the same-level accuracy. More importantly, the framework can be …
Poster
Huan Ren · Wenfei Yang · Shifeng Zhang · Tianzhu Zhang
[ ExHall D ]
Abstract
Category-level object pose estimation aims to determine the pose and size of arbitrary objects within given categories. Existing two-stage correspondence-based methods first establish correspondences between camera and object coordinates, and then acquire the object pose using a pose fitting algorithm. In this paper, we conduct a comprehensive analysis of this paradigm and introduce two crucial essentials: 1) shape-sensitive and pose-invariant feature extraction for accurate correspondence prediction, and 2) outlier correspondence removal for robust pose fitting. Based on these insights, we propose a simple yet effective correspondence-based method called SpotPose, which includes two stages. During the correspondence prediction stage, pose-invariant geometric structure of objects is thoroughly exploited to facilitate shape-sensitive holistic interaction among keypoint-wise features. During the pose fitting stage, outlier scores of correspondences are explicitly predicted to facilitate efficient identification and removal of outliers. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate that the proposed SpotPose outperforms state-of-the-art approaches by a large margin. Our code will be released.
Poster
Ming-Feng Li · Xin Yang · Fu-En Wang · Hritam Basak · Yuyin Sun · Shreekant Gayaka · Min Sun · Cheng-Hao Kuo
[ ExHall D ]
Abstract
6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object’s appearance and geometry, remains challenging.To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model.Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness.We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when …
Poster
Bin Tan · Rui Yu · Yujun Shen · Nan Xue
[ ExHall D ]
Abstract
This paper presents PlanarSplatting, an ultra-fast and accurate surface reconstruction approach for multiview indoor images. We take the 3D planes as the main objective due to their compactness and structural expressiveness in indoor scenes, and develop an explicit optimization framework that learns to fit the expected surface of indoor scenes by splatting the 3D planes into 2.5D depth and normal maps. As our PlanarSplatting operates directly on the 3D plane primitives, it eliminates the dependencies on 2D/3D plane detection and plane matching and tracking for planar surface reconstruction. Furthermore, the essential merits of plane-based representation plus CUDA-based implementation of planar splatting functions, PlanarSplatting reconstructs an indoor scene in 3 minutes while having significantly better geometric accuracy. Thanks to our ultra-fast reconstruction speed, the largest quantitative evaluation on the ScanNet and ScanNet++ datasets over hundreds of scenes clearly demonstrated the advantages of our method. We believe that our accurate and ultrafast planar surface reconstruction method will be applied in the structured data curation for surface reconstruction in the future. The code of our CUDA implementation will be publicly available.
Poster
Xiuqiang Song · Li Jin · Zhengxian Zhang · Jiachen Li · Fan Zhong · Guofeng Zhang · Xueying Qin
[ ExHall D ]
Abstract
In this paper, we introduce a novel, truly prior-free 3D object tracking method that operates without given any model or training priors. Unlike existing methods that typically require pre-defined 3D models or specific training datasets as priors, which limit their applicability, our method is free from these constraints. Our method consists of a geometry generation module and a pose optimization module. Its core idea is to enable these two modules to automatically and iteratively enhance each other, thereby gradually building all the necessary information for the tracking task. We thus call the method as Bidirectional Iterative Tracking(BIT). The geometry generation module starts without priors and gradually generates high-precision mesh models for tracking, while the pose optimization module generates additional data during object tracking to further refine the generated models. Moreover, the generated 3D models can be stored and easily reused, allowing for seamless integration into various other tracking systems, not just our methods. Experimental results demonstrate that BIT outperforms many existing methods, even those that extensively utilize prior knowledge, while BIT does not rely on such information. Additionally, the generated 3D models deliver results comparable to actual 3D models, highlighting their superior and innovative qualities.
Poster
Guiyu Zhao · Sheng Ao · Ye Zhang · Kai Xu · Yulan Guo
[ ExHall D ]
Abstract
Obtaining enough high-quality correspondences is crucial for robust registration. Existing correspondence refinement methods mostly follow the paradigm of outlier removal, which either fails to correctly identify the accurate correspondences under extreme outlier ratios, or select too few correct correspondences to support robust registration. To address this challenge, we propose a novel approach named Regor, which is a progressive correspondence regenerator that generates higher-quality matches whist sufficiently robust for numerous outliers. In each iteration, we first apply prior-guided local grouping and generalized mutual matching to generate the local region correspondences. A powerful center-aware three-point consistency is then presented to achieve local correspondence correction, instead of removal. Further, we employ global correspondence refinement to obtain accurate correspondences from a global perspective. Through progressive iterations, this process yields a large number of high-quality correspondences. Extensive experiments on both indoor and outdoor datasets demonstrate that the proposed Regor significantly outperforms existing outlier removal techniques. More critically, our approach obtain 10 times more correct correspondences than outlier removal methods. As a result, our method is able to achieve robust registration even with weak features. The code will be released.
Poster
Amir Etefaghi Daryani · M. Usman Maqbool Bhutta · Byron Hernandez · Henry Medeiros
[ ExHall D ]
Abstract
Multi-view object detection in crowded environments presents significant challenges, particularly for occlusion management across multiple camera views. This paper introduces a novel approach that extends conventional multi-view detection to operate directly within each camera's image space. Our method finds objects bounding boxes for images from various perspectives without resorting to a bird’s eye view (BEV) representation. Thus, our approach removes the need for camera calibration by leveraging a learnable architecture that facilitates flexible transformations and improves feature fusion across perspectives to increase detection accuracy. Our model achieves Multi-Object Detection Accuracy (MODA) scores of 95.0% and 96.5% on the Wildtrack and MultiviewX datasets, respectively, significantly advancing the state of the art in multi-view detection. Furthermore, it demonstrates robust performance even without ground truth annotations, highlighting its resilience and practicality in real-world applications. These results emphasize the effectiveness of our calibration-free, multi-view object detector.
Poster
Theo Bodrito · Olivier Flasseur · Julien Mairal · Jean Ponce · Maud Langlois · Anne-Marie Lagrange
[ ExHall D ]
Abstract
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
Poster
Huy Nguyen · Kien Nguyen Thanh · Akila Pemasiri · Feng Liu · Sridha Sridharan · Clinton Fookes
[ ExHall D ]
Abstract
We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models will be released upon publication.
Poster
Shuo Wang · Wanting Li · Yongcai Wang · Zhaoxin Fan · Zhe Huang · xudong cai · Jian Zhao · Deying Li
[ ExHall D ]
Abstract
Deep visual odometry has demonstrated great advancements by learning-to-optimize technology. This approach heavily relies on the visual matching across frames. However, ambiguous matching in challenging scenarios leads to significant errors in geometric modeling and bundle adjustment optimization, which undermines the accuracy and robustness of pose estimation. To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation in deep visual odometry. Specifically, when a new frame is received, it is matched with the closest keyframe in the maintained Point-Frame Graph (PFG) via the semi-dense based Geometric Initialization Module (GIM). Then the initialized PFG is processed by a proposed Geometric Mamba Module (GMM), which exploits the matching features to refine the overall inter-frame pixel-to-pixel matching. The refined PFG is finally processed by deep BA to optimize the poses and the map. To deal with the gradient variance, a Trending-Aware Penalty (TAP) is proposed to smooth training by balancing the pose loss and the matching loss to enhance convergence and stability. A loop closure module is finally applied to enable MambaVO++. On public benchmarks, Mamba and MambaVO++ demonstrate SOTA accuracy performance, while ensuring real-time …
Poster
Hongyu Sun · Qiuhong Ke · Ming Cheng · Yongcai Wang · Deying Li · Chenhui Gou · Jianfei Cai
[ ExHall D ]
Abstract
This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data—often inaccessible during online inference—and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop \textbf{Point-Cache}, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness.
Poster
Zimo Wang · Cheng Wang · Taiki Yoshino · Sirui Tao · Ziyang Fu · Tzu-Mao Li
[ ExHall D ]
Abstract
We propose a method, HotSpot, for optimizing neural signed distance functions, based on a relation between the solution of a screened Poisson equation and the distance function.Existing losses such as the eikonal loss cannot guarantee the recovered implicit function to be a distance function, even when the implicit function satisfies the eikonal equation almost everywhere.Furthermore, the eikonal loss suffers from stability issues in optimization and the remedies that introduce area or divergence minimization can lead to oversmoothing.We address these challenges by designing a loss function that when minimized can converge to the true distance function, is stable, and naturally penalize large surface area.We provide theoretical analysis and experiments on both challenging 2D and 3D datasets and show that our method provide better surface reconstruction and more accurate distance approximation.
Poster
Yuanqi Li · Jingcheng Huang · Hongshen Wang · Peiyuan Lv · Yansong Liu · Jiuming Zheng · Jie Guo · Yanwen Guo
[ ExHall D ]
Abstract
The proliferation of Light Detection and Ranging (LiDAR) technology has facilitated the acquisition of three-dimensional point clouds, which are integral to applications in VR, AR, and Digital Twin. Oriented normals, critical for 3D reconstruction and scene analysis, cannot be directly extracted from scenes using LiDAR due to its operational principles. Previous traditional or learning-based methods are prone to inaccuracies due to uneven distribution and noise due to the dependence on local geometry features. This paper addresses the challenge of estimating oriented point normals by introducing a point cloud normal estimation framework via hybrid angular and Euclidean distance encoding (HAE). Our method overcomes the limitations of local geometric information by combining angular and Euclidean spaces to extract features from both point cloud coordinates and light rays, leading to more accurate normal estimation. The core of our network consists of an angular distance encoding module, which leverages both ray directions and point coordinates for unoriented normal refinement, and a ray feature fusion module for normal orientation, that is robust to noise. We also provide a point cloud dataset with ground truth normals, generated a virtual scanner, which reflects real scanning distributions and noise profiles.
Poster
Jiangbei Hu · Yanggeng Li · Fei Hou · Junhui Hou · Zhebin Zhang · Shengfa Wang · Na Lei · Ying He
[ ExHall D ]
Abstract
Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large 3D shape datasets, which is costly and necessitates re-training for new datasets. This paper presents a novel neural framework, *LoSF-UDF*, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns in localized regions, prompting us to develop a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. Despite being highly lightweight, with only 653 KB of trainable parameters and a modest-sized training dataset with 0.5 GB storage, our method enables efficient and robust surface reconstruction from point clouds without requiring for shape-specific training. Furthermore, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We conduct comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to …
Poster
An Li · Zhe Zhu · Mingqiang Wei
[ ExHall D ]
Abstract
Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors.Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we designed the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale.Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.
Poster
Ziyi Wang · Yanran Zhang · Jie Zhou · Jiwen Lu
[ ExHall D ]
Abstract
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the FIRST unified pre-training method that can be seamlessly applied to point clouds of ANY scale and 3D models of ANY architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones.
Poster
Ziyin Zeng · Mingyue Dong · Jian Zhou · Huan Qiu · Zhen Dong · Man Luo · Bijun Li
[ ExHall D ]
Abstract
Due to the irregular and disordered data structure in 3D point clouds, prior works have focused on designing more sophisticated local representation methods to capture these complex local patterns. However, the recognition performance has saturated over the past few years, indicating that increasingly complex and redundant designs no longer make improvements to local learning. This phenomenon prompts us to diverge from the trend in 3D vision and instead pursue an alternative and successful solution: deeper neural networks. In this paper, we propose DeepLA-Net, a series of very deep networks for point cloud analysis. The key insight of our approach is to exploit a small but mighty local learning block, which reduces 10× fewer FLOPs, enabling the construction of very deep networks. Furthermore, we design a training supervision strategy to ensure smooth gradient backpropagation and optimization in very deep networks. We construct the DeepLA-Net family with a depth of up to 120 blocks --- at least 5× deeper than recent methods --- trained on a single RTX 3090. An ensemble of the DeepLA-Net achieves state-of-the-art performance on classification and segmentation tasks of S3DIS Area5 (+2.2\% mIoU), ScanNet test set (+1.6\% mIoU), ScanObjectNN (+2.1\% OA), and ShapeNetPart (+0.9\% cls.mIoU).
Poster
Chengzhi Wu · Yuxin Wan · Hao Fu · Julius Pfrommer · Zeyun Zhong · Junwei Zheng · Jiaming Zhang · Jürgen Beyerer
[ ExHall D ]
Abstract
Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.
Poster
Jianan Ye · Weiguang Zhao · Xi Yang · Guangliang Cheng · Kaizhu Huang
[ ExHall D ]
Abstract
Point cloud anomaly detection under the anomaly-free setting poses significant challenges as it requires accurately capturing the features of 3D normal data to identify deviations indicative of anomalies. Current efforts focus on devising reconstruction tasks, such as acquiring normal data representations by restoring normal samples from altered, pseudo-anomalous counterparts. Our findings reveal that distributing attention equally across normal and pseudo-anomalous data tends to dilute the model's focus on anomalous deviations. The challenge is further compounded by the inherently disordered and sparse nature of 3D point cloud data. In response to those predicaments, we introduce an innovative approach that emphasizes learning point offsets, targeting more informative pseudo-abnormal points, thus fostering more effective distillation of normal data representations. We also have crafted an augmentation technique that is steered by normal vectors, facilitating the creation of credible pseudo anomalies that enhance the efficiency of the training process. Our comprehensive experimental evaluation on the Anomaly-ShapeNet and Real3D-AD datasets evidences that our proposed method outperforms existing state-of-the-art approaches, achieving an average enhancement of 9.0% and 1.4% in the AUC-ROC detection metric across these datasets, respectively.
Poster
Shaocheng Yan · Yiming Wang · Kaiyan Zhao · Pengcheng Shi · Zhenjun Zhao · Yongjun Zhang · Jiayuan Li
[ ExHall D ]
Abstract
Heuristic information for consensus set sampling is essential for correspondence-based point cloud registration, but existing approaches typically rely on supervised learning or expert-driven parameter tuning. In this work, we propose HeMoRa, a new unsupervised framework that trains a Heuristic information Generator (HeGen) to estimate sampling probabilities for correspondences using a Multi-order Reward Aggregator (MoRa) loss. The core of MoRa is to train HeGen through extensive trials and feedback, enabling unsupervised learning. While this process can be implemented using policy optimization, directly applying the policy gradient to optimize HeGen presents challenges such as sensitivity to noise and low reward efficiency. To address these issues, we propose a Maximal Reward Propagation (MRP) mechanism that enhances the training process by prioritizing noise-free signals and improving reward utilization. Experimental results show that equipped with HeMoRa, the consensus set sampler achieves improvements in both robustness and accuracy. For example, on the 3DMatch dataset with FCGF feature, the registration recall of our unsupervised methods (Ours+SM and Ours+SC2) even outperforms the state-of-the-art supervised method VBreg. Our code is available at \href{https://anonymous.4open.science/r/HeMoRa-CVPR/README.md}{\texttt{HeMoRa}}.
Poster
Zihui Zhang · Weisheng Dai · Hongtao Wen · Bo Yang
[ ExHall D ]
Abstract
We study the hard problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two popular indoor scene datasets and a challenging outdoor dataset show that our LogoSP clearly surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
Poster
Runmao Yao · Yi Du · Zhuoqun Chen · Haoze Zheng · Chen Wang
[ ExHall D ]
Abstract
Room reidentification (ReID) is a challenging yet essential task with numerous applications in fields such as augmented reality (AR) and homecare robotics. Existing visual place recognition (VPR) methods, which typically rely on global descriptors or aggregate local features, often struggle in cluttered indoor environments densely populated with man-made objects. These methods tend to overlook the crucial role of object-oriented information. To address this, we propose AirRoom, an object-aware pipeline that integrates multi-level object-oriented information—from global context to object patches, object segmentation, and keypoints—utilizing a coarse-to-fine retrieval approach. Extensive experiments on four newly constructed datasets—MPReID, HMReID, GibsonReID, and ReplicaReID—demonstrate that AirRoom outperforms state-of-the-art (SOTA) models across nearly all evaluation metrics, with improvements ranging from 6% to 80%. Moreover, AirRoom exhibits significant flexibility, allowing various modules within the pipeline to be substituted with different alternatives without compromising overall performance. It also shows robust and consistent performance under diverse viewpoint variations.
Poster
Fajwel Fogel · Yohann PERRON · Nikola Besic · Laurent Saint-André · Agnès Pellissier-Tanon · Thomas Boudras · Martin Schwartz · Ibrahim Fayad · Alexandre d'Aspremont · Loic Landrieu · Philippe Ciais
[ ExHall D ]
Abstract
Estimating canopy height and its changes at meter resolution from satellite imagery is a significant challenge in computer vision with critical environmental applications.However, the lack of open-access datasets at this resolution hinders the reproducibility and evaluation of models. We introduce Open-Canopy, the first open-access, country-scale benchmark for very high-resolution (1.5 m) canopy height estimation, covering over 87,000 km² across France with 1.5 m resolution satellite imagery and aerial LiDAR data.Additionally, we present Open-Canopy-∆, a benchmark for canopy height change detection between images from different years at tree level—a challenging task for current computer vision models. We evaluate state-of-the-art architectures on these benchmarks, highlighting significant challenges and opportunities for improvement. Our datasets and code are publicly available at [URL].
Poster
Xin Jin · Haisheng Su · Kai Liu · CONG MA · Wei Wu · Fei HUI · Junchi Yan
[ ExHall D ]
Abstract
Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and …
Poster
Qiming Xia · Wenkai Lin · Haoen Xiang · Xun Huang · Siheng Chen · Zhen Dong · Cheng Wang · Chenglu Wen
[ ExHall D ]
Abstract
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently, DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks. The code will be publicly available.
Poster
Longyu Yang · Ping Hu · Shangbo Yuan · Lu Zhang · Jun Liu · Heng Tao Shen · Xiaofeng Zhu
[ ExHall D ]
Abstract
Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to process geometric and reflectance features independently initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.
Poster
R.D. Lin · Pengcheng Weng · Yinqiao Wang · Han Ding · Jinsong Han · Fei Wang
[ ExHall D ]
Abstract
LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. We observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches.
Poster
Zakaria Laskar · Tomas Vojir · Matej Grcic · Iaroslav Melekhov · Shankar Gangisetty · Juho Kannala · Jiri Matas · Giorgos Tolias · C.V. Jawahar
[ ExHall D ]
Abstract
Before deployment in the real-world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only either knowns or unknowns - but not both, which is required to establish “in the wild” suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, featuring a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in-domain evaluation. The test set consists of a static and temporal part, with the later comprised of videos. The dataset provides annotations for both closed-set (knowns) and anomalies, enabling closed-set and open-set evaluation. The dataset covers diverse conditions, such as domain and cross-sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of currentstate-of-the-art methods confirm the need for improvements especially in domain-generalization, small and large object segmentation. The benchmarking code and the dataset will be made available.
Poster
Ben Agro · Sergio Casas · Patrick Wang · Thomas Gilles · Raquel Urtasun
[ ExHall D ]
Abstract
To perceive, humans use memory to fill in gaps caused by our limited visibility, whether due to occlusion or our narrow field of view. However, most 3D object detectors are limited to using sensor evidence from a short temporal window (0.1s-0.3s). In this work, we present a simple and effective add-on for enhancing any existing 3D object detector with long-term memory regardless of its sensor modality (e.g., LiDAR, camera) and network architecture. We propose a model to effectively align and fuse object proposals from a detector with object proposals from a memory bank of past predictions, exploiting trajectory forecasts to align proposals across time. We propose a novel schedule to train our model on temporal data that balances data diversity and the gap between training and inference. By applying our method to existing LiDAR and camera-based detectors on the Waymo Open Dataset (WOD) and Argoverse 2 Sensor (AV2) dataset, we demonstrate significant improvements in detection performance (+2.5 to +7.6 AP points). Our method attains the best performance on the WOD 3D detection leaderboard among online methods (excluding ensembles or test-time augmentation).
Poster
Cédric Vincent · Taehyoung Kim · Henri Meeß
[ ExHall D ]
Abstract
Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach—suited to onboard real-time inference—achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, …
Poster
Lingdong Kong · Dongyue Lu · Xiang Xu · Lai Xing Ng · Wei Tsang Ooi · Benoit Cottereau
[ ExHall D ]
Abstract
Cross-platform adaptation in event-based dense perception is crucial for deploying event cameras across diverse settings, such as vehicles, drones, and quadrupeds, each with unique motion dynamics, viewpoints, and class distributions. In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. Our approach comprises three key components: i) Event Activation Prior (EAP), which identifies high-activation regions in the target domain to minimize prediction entropy, fostering confident, domain-adaptive predictions; ii) EventBlend, a data-mixing strategy that integrates source and target event voxel grids based on EAP-driven similarity and density maps, enhancing feature alignment; and iii) EventMatch, a dual-discriminator technique that aligns features from source, target, and blended domains for better domain-invariant learning. To holistically assess cross-platform adaptation abilities, we introduce EXPo, a large-scale benchmark with diverse samples across vehicle, drone, and quadruped platforms. Extensive experiments validate our effectiveness, demonstrating substantial gains over popular adaptation methods. We hope this work can pave the way for adaptive, high-performing event perception across diverse and complex environments.
Poster
Tianchen Deng · Guole Shen · Chen Xun · Shenghai Yuan · Tongxing Jin · Hongming Shen · Yanbo Wang · Jingchuan Wang · Hesheng Wang · Danwei Wang · Weidong Chen
[ ExHall D ]
Abstract
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulty in large indoor scenes and long sequences. Existing multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative SLAM framework with distributed mapping and camera tracking, joint scene representation, intra-to-inter loop closure, and multi-submap fusion. Specifically, our proposed distributed neural mapping and tracking framework only needs peer-to-peer communication, which can greatly improve multi-agent cooperation and communication efficiency. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world indoor neural slam (INS) dataset covering both single-agent and multi-agent scenarios, ranging from small room to large-scale scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the community. Experiments onvarious datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset …
Poster
Xin Ye · Burhaneddin Yaman · Sheng Cheng · Feng Tao · Abhirup Mallik · Liu Ren
[ ExHall D ]
Abstract
Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.
Poster
Dubing Chen · Huan Zheng · Jin Fang · Xingping Dong · Xianfei Li · Wenlong Liao · Tao He · Pai Peng · Jianbing Shen
[ ExHall D ]
Abstract
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, with a focus on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistencies, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and provide distinctive contributions across various modules in the general VisionOcc framework.To effectively fuse temporal signals of different representations, we introduce a novel fusion strategy by reinterpreting vanilla RNNs. This approach utilizes gradient descent on features to unify the integration of diverse temporal information. Extensive experiments on NuScenes demonstrate that GDFusion significantly outperforms established baselines, delivering a consistent increase in mIoU between 2.2\% to 4.7\% with less memory consumption. Codes will be made publicly available.
Poster
Zhimin Liao · Ping Wei · Shuaijia Chen · Haoxuan Wang · Ziyang Ren
[ ExHall D ]
Abstract
3D occupancy and scene flow offer a detailed and dynamic representation of 3D space, which is particularly critical for autonomous driving applications. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel sparse-based method to model long-term dynamic information, conserving computation while preserving spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and MAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to less than 8.7GB.
Poster
Pan Yin · Kaiyu Li · Xiangyong Cao · Jing Yao · Lei Liu · Xueru Bai · Feng Zhou · Deyu Meng
[ ExHall D ]
Abstract
Recently, road graph extraction has garnered increasing attention due to its crucial role in autonomous driving, navigation, etc. However, accurately and efficiently extracting road graphs remains a persistent challenge, primarily due to the severe scarcity of labeled data. To address this limitation, we collect a global-scale satellite road graph extraction dataset, i.e. Global-Scale dataset. Specifically, the Global-Scale dataset is ∼20× larger than the largest existing public road extraction dataset and spans over 13,800 km2 globally. Additionally, we develop a novel road graph extraction model, i.e. SAM-Road++, which adopts a node-guided resampling method to alleviate the mismatch issue between training and inference in SAM-Road, a pioneering state-of-the-art road graph extraction model. Furthermore, we propose a simple yet effective extended-line'' strategy in SAM-Road++ to mitigate the occlusion issue on the road. Extensive experiments demonstrate the validity of the collected Global-Scale dataset and the proposed SAM-Road++ method, particularly highlighting its superior predictive power in unseen regions.
Poster
Chenxu Zhou · Lvchang Fu · Sida Peng · Yunzhi Yan · Zhanhua Zhang · chen yong · Jiazhi Xia · Xiaowei Zhou
[ ExHall D ]
Abstract
This paper targets the challenge of real-time LiDAR re-simulation in dynamic driving scenarios. Recent approaches utilize neural radiance fields combined with the physical modeling of LiDAR sensors to achieve high-fidelity re-simulation results. Unfortunately, these methods face limitations due to high computational demands in large-scale scenes and cannot perform real-time LiDAR rendering. To overcome these constraints, we propose LiDAR-RT, a novel framework that supports real-time, physically accurate LiDAR re-simulation for driving scenes. Our primary contribution is the development of an efficient and effective rendering pipeline, which integrates Gaussian primitives and hardware-accelerated ray tracing technology. Specifically, we model the physical properties of LiDAR sensors using Gaussian primitives with learnable parameters and incorporate scene graphs to handle scene dynamics. Building upon this scene representation, our framework first constructs a bounding volume hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views through a differentiable rendering algorithm. Importantly, our framework supports realistic rendering with flexible scene editing operations and various sensor configurations. Extensive experiments across multiple public benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of rendering quality and efficiency. Codes will be released to ensure reproducibility.
Poster
Jingqiu Zhou · Lue Fan · Linjiang Huang · Zhaoxiang Zhang · Xiaoyu Shi · Si Liu · Hongsheng Li
[ ExHall D ]
Abstract
Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting.However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views.For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data.Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset.In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin.
Poster
Chaojun Ni · Guosheng Zhao · Xiaofeng Wang · Zheng Zhu · Wenkang Qin · Guan Huang · Chen Liu · Yuyin Chen · Yida Wang · Xueyang Zhang · Yifei Zhan · Kun Zhan · Peng Jia · XianPeng Lang · Xingang Wang · Wenjun Mei
[ ExHall D ]
Abstract
Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent work, DriveDreamer4D, has demonstrated that integrating world model knowledge alleviates these issues. Although the training-free integration approach is efficient, it still struggles to render larger maneuvers, such as multi-lane shifts.Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, based on the world model, DriveRestorer is proposed to mitigate ghosting artifacts via online restoration. Additionally, we propose the progressive data update strategy to ensure high-quality rendering for larger maneuvers. Notably, ReconDreamer is the first method to effectively render in large maneuvers (e.g., across multiple lanes, spanning up to 6 meters). Additionally, experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with a relative improvement by 24.87\%, 6.72\%, and 29.97\%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87\% in the NTA-IoU metric and a comprehensive user study.
Poster
Shuhan Tan · John Wheatley Lambert · Hong Jeon · Sakshum Kulshrestha · Yijing Bai · Jing Luo · Dragomir Anguelov · Mingxing Tan · Chiyu “Max” Jiang
[ ExHall D ]
Abstract
The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the …
Poster
Ziyang Xie · Zhizheng Liu · Zhenghao Peng · Wayne Wu · Bolei Zhou
[ ExHall D ]
Abstract
Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photo-realistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods. Code and data will be made publicly available.
Poster
Yuchen Xia · Quan Yuan · Guiyang Luo · Xiaoyuan Fu · Yang Li · Xuanhan Zhu · Tianyou Luo · Siheng Chen · Jinglin Li
[ ExHall D ]
Abstract
Collaborative perception in autonomous driving significantly enhances the perception capabilities of individual agents. Immutable heterogeneity in collaborative perception, where agents have different and fixed perception networks, presents a major challenge due to the semantic gap in their exchanged intermediate features without modifying the perception networks. Most existing methods bridge the semantic gap through interpreters. However, they either require training a new interpreter for each new agent type, limiting extensibility, or rely on a two-stage interpretation via an intermediate standardized semantic space, causing cumulative semantic loss. To achieve both extensibility in immutable heterogeneous scenarios and low-loss feature interpretation, we propose PolyInter, a polymorphic feature interpreter. It contains an extension point through which emerging new agents can seamlessly integrate by overriding only their specific prompts, which are learnable parameters intended to guide the interpretation, while reusing PolyInter’s remaining parameters. By leveraging polymorphism, our design ensures that single interpreter is sufficient to accommodate diverse agents and interpret their features into the ego agent’s semantic space. Experiments conducted on OPV2V dataset demonstrate that PolyInter improves collaborative perception precision by up to 11.1% compared to SOTA interpreters, while comparable results can be achieved by training only 1.4% of PolyInter's parameters when adapting to new agents.
Poster
Zebin Xing · Xingyu Zhang · Yang Hu · Bo Jiang · Tong He · Qian Zhang · Xiaoxiao Long · Wei Yin
[ ExHall D ]
Abstract
In this paper, we propose an end-to-end autonomous driving method focused on generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory; multimodal trajectory generation aims to provide multiple trajectory candidates, enhancing the system's reliability and human-computer interaction. Although recent works have shown success, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. To tackle the issue of information inconsistency, GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information.Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. Compared to the state-of-the-art methods, which scored 84.0, our model achieves an improvement of …
Poster
Zikang Zhou · Hengjian Zhou · Haibo Hu · Zihao WEN · Jianping Wang · Yung-Hui Li · Yu-Kai Huang
[ ExHall D ]
Abstract
Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and misaligned mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or rule-based trajectory selection, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the …
Poster
Yichen Xie · Runsheng Xu · Tong He · Jyh-Jing Hwang · Katie Z Luo · Jingwei Ji · Hubert Lin · Letian Chen · Yiren Lu · Zhaoqi Leng · Dragomir Anguelov · Mingxing Tan
[ ExHall D ]
Abstract
The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches—which directly learn from sensor inputs to generate planning trajectories without human annotations—often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in perspective view space rather than the native 3D space that autonomous vehicles plan in. To this end, we propose PaLI-Driver, based on the popular PaLI vision-language model. PaLI-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multiview and multi-frame visual inputs and enables better pre diction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and our in-house collected dataset X-Planning. Results show that PaLI-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of …
Poster
Yifan Wang · Jian Zhao · Zhaoxin Fan · Xin Zhang · Xuecheng Wu · Yudian Zhang · Lei Jin · Xinyue Li · Gang Wang · Mengxi Jia · Ping Hu · Zheng Zhu · Xuelong Li
[ ExHall D ]
Abstract
Unmanned Aerial Vehicles (UAVs) are widely adopted across various fields, yet they raise significant privacy and safety concerns, demanding robust monitoring solutions. Existing anti-UAV methods primarily focus on position tracking but fail to capture UAV behavior and intent. To address this, we introduce a novel task—UAV Tracking and Intent Understanding (UTIU)—which aims to track UAVs while inferring and describing their motion states and intent for a more comprehensive monitoring approach. To tackle the task, we propose JTD-UAV, the first joint tracking, and intent description framework based on large language models. Our dual-branch architecture integrates UAV tracking with Visual Question Answering (VQA), allowing simultaneous localization and behavior description. To benchmark this task, we introduce the TDUAV dataset, the largest dataset for joint UAV tracking and intent understanding, featuring 1,328 challenging video sequences, over 163K annotated thermal frames, and 3K VQA pairs. Our benchmark demonstrates the effectiveness of JTD-UAV, and both the dataset and code will be publicly available.
Poster
Ruiqi Qiu · JUN GONG · Xinyu Zhang · Siqi Luo · Bowen Zhang · Yi Cen
[ ExHall D ]
Abstract
The ability to adapt to varying observation lengths is crucial for trajectory prediction tasks, particularly in scenarios with limited observation lengths or missing data. Existing approaches mainly focus on introducing novel architectures or additional structural components, which substantially increase model complexity and present challenges for integration into existing models. We argue that current network architectures are sufficiently sophisticated to handle the Observation Length Shift problem, with the key challenge lying in improving feature representation for trajectories with limited lengths. To tackle this issue, we introduce a general and effective contrastive learning approach, called Contrastive Learning for Length Shift (CLLS). By incorporating contrastive learning during the training phase, our method encourages the model to extract length-invariant features, thus mitigating the impact of observation length variations. Furthermore, to better accommodate length adaptation tasks, we introduce a lightweight RNN network that, combined with CLLS, achieves state-of-the-art performance in both general prediction and observation length shift tasks. Experimental results demonstrate that our approach outperforms existing methods across multiple widely-used trajectory prediction datasets.
Poster
Dianze Li · Jianing Li · Xu Liu · Xiaopeng Fan · Yonghong Tian
[ ExHall D ]
Abstract
Integrating frames and events has become a widely accepted solution for various tasks in challenging scenarios. However, most multimodal methods directly convert events into image-like formats synchronized with frames and process each stream through separate two-branch backbones, making it difficult to fully exploit the spatiotemporal events while limiting inference frequency to the frame rate. To address these problems, we propose a novel asynchronous collaborative graph representation, namely ACGR, which is the first trial to explore a unified graph framework for asynchronously processing frames and events with high performance and low latency. Technically, we first construct unimodal graphs for frames and events to preserve their spatiotemporal properties and sparsity. Then, an asynchronous collaborative alignment module is designed to align and fuse frames and events into a unified graph and the ACGR is generated through graph convolutional networks. Finally, we innovatively introduce domain adaptation to enable cross-modal interactions between frames and events by aligning their feature spaces. Experimental results show that our approach outperforms state-of-the-art methods in both object detection and depth estimation tasks, while significantly reducing computational latency and achieving real-time inference up to 200 Hz. Our code will be open-sourced and available in the supplementary material.
Poster
Huangyue Yu · Baoxiong Jia · Yixin Chen · Yandan Yang · Puhao Li · Rongpeng Su · Jiaxin Li · Qing Li · Wei Liang · Song-Chun Zhu · Tengyu Liu · Siyuan Huang
[ ExHall D ]
Abstract
Embodied AI (EAI) research depends on high-quality and diverse 3D scenes to enable effective skill acquisition, sim-to-real transfer, and domain generalization. Recent 3D scene datasets are still limited in scalability due to their dependence on artist-driven designs and challenges in replicating the diversity of real-world objects. To address these limitations and automate the creation of 3D simulatable scenes, we present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility. Leveraging the recent advancements in object-level modeling, we provide each object with a curated set of candidates, ranked through human annotation for optimal replacements based on geometry, texture, and functionality. These annotations enable a novel multi-modal alignment model, SCAN2SIM, which facilitates automated and high-quality asset replacement. We further validate the utility of our dataset with two benchmarks: Micro-Scene Synthesos for small object layout generation and cross-domain vision-language navigation (VLN). Results confirm the potential of METASCENES to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research.
Poster
Dongyue Lu · Lingdong Kong · Tianxin Huang · Gim Hee Lee
[ ExHall D ]
Abstract
Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. We will release our datasets and source-code as open source upon paper acceptance.
Poster
Chunlin Yu · Hanqing Wang · Ye Shi · Haoyang Luo · Sibei Yang · Jingyi Yu · Jingya Wang
[ ExHall D ]
Abstract
3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.
Poster
Qingqing Zhao · Yao Lu · Moo Jin Kim · Zipeng Fu · Zhuoyang Zhang · Yecheng Wu · Max Li · Qianli Ma · Song Han · Chelsea Finn · Ankur Handa · Tsung-Yi Lin · Gordon Wetzstein · Ming-Yu Liu · Donglai Xiang
[ ExHall D ]
Abstract
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17\% in real-world manipulation tasks and 6\% in simulation benchmarks.
Poster
Zhenyu Wu · Yuheng Zhou · Xiuwei Xu · Ziwei Wang · Haibin Yan
[ ExHall D ]
Abstract
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training.In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks.Therefore, we propose an efficient policy adaptation framework to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy.Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. Extensive experimental results on OVMM and the real world demonstrate that our method achieves a 4.2\% higher success rate than the state-of-the-art mobile manipulation, and …
Poster
Yuheng Ji · Huajie Tan · Jiayu Shi · Xiaoshuai Hao · Yuan Zhang · Hengyuan Zhang · Pengwei Wang · Mengdi Zhao · Yao Mu · Pengju An · Xinda Xue · Qinghang Su · Huaihai Lyu · Xiaolong Zheng · Jiaming Liu · Zhongyuan Wang · Shanghang Zhang
[ ExHall D ]
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: \textbf{Planning Capability}, which involves decomposing complex manipulation instructions into manageable sub-tasks; \textbf{Affordance Perception}, the ability to recognize and interpret the affordances of interactive objects; and \textbf{Trajectory Prediction}, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce \textbf{ShareRobot}, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed \textbf{RoboBrain}, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities.Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various obotic tasks, highlighting its potential to advance robotic brain capabilities.
Poster
Tianxing Chen · Yao Mu · Zhixuan Liang · Zanxin Chen · ShijiaPeng · Qiangyu Chen · Mingkun Xu · Ruizhen Hu · Hongyuan Zhang · Xuelong Li · Ping Luo
[ ExHall D ]
Abstract
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
Poster
Zhixuan Liang · Yao Mu · Yixiao Wang · Fei Ni · Tianxing Chen · Wenqi Shao · Wei Zhan · Masayoshi Tomizuka · Ping Luo · Mingyu Ding
[ ExHall D ]
Abstract
Dexterous manipulation with contact-rich interactions is crucial for advanced robotics. While recent diffusion-based planning approaches show promise for simpler manipulation tasks, they often produce unrealistic ghost states (e.g., the object automatically moves without hand contact) or lack adaptability when handling complex sequential interactions. In this work, we introduce DexDiffuser, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. DexDiffuser models joint state-action dynamics through a dual-phase diffusion process which consists of pre-interaction contact alignment and post-contact goal-directed control, enabling goal-adaptive generalizable dexterous manipulation. Additionally, we incorporate dynamics model-based dual guidance and leverage large language models for automated guidance function generation, enhancing generalizability for physical interactions and facilitating diverse goal adaptation through language cues. Experiments on physical interaction tasks such as door opening, pen and block re-orientation, and hammer striking demonstrate DexDiffuser's effectiveness on goals outside training distributions, achieving over twice the average success rate (59.2% vs. 29.5%) compared to existing methods. Our framework achieves 70.0% success on 30-degree door opening, 40.0% and 36.7% on pen and block half-side re-orientation respectively, and 46.7% on hammer nail half drive, highlighting its robustness and flexibility in contact-rich manipulation.
Poster
Guangyan Chen · Te Cui · Meiling Wang · Yang Chengcai · Mengxiao Hu · Haoyang Lu · Yao Mu · Zicai Peng · Tianxing Zhou · XINRAN JIANG · Yi Yang · Yufeng Yue
[ ExHall D ]
Abstract
Learning from demonstration is a powerful method for robotic skill acquisition. However, the significant expense of collecting such action-labeled robot data presents a major bottleneck. Video data, a rich data source encompassing diverse behavioral and physical knowledge, emerges as a promising alternative. In this paper, we present GraphMimic, a novel paradigm that leverages video data via graph-to-graphs generative modeling, which pre-trains models to generate future graphs conditioned on the graph within a video frame. Specifically, GraphMimic abstracts video frames into object and visual action vertices, and constructs graphs for state representations. The graph generative modeling network then effectively models internal structures and spatial relationships within the constructed graphs, aiming to generate future graphs. The generated graphs serve as conditions for the control policy, mapping to robot actions. Our concise approach captures important spatial relations and enhances future graph generation accuracy, enabling the acquisition of robust policies from limited action-labeled data. Furthermore, the transferable graph representations facilitate the effective learning of manipulation skills from cross-embodiment videos. Our experiments exhibit that GraphMimic achieves superior performance using merely 20% action-labeled data. Moreover, our method outperforms the state-of-the-art method by over 17% and 23% in simulation and real-world experiments, and delivers improvements of over …
Poster
Yun Liu · Chengwen Zhang · Ruofan Xing · Bingda Tang · Bowen Yang · Li Yi
[ ExHall D ]
Abstract
Understanding how humans cooperatively rearrange household objects is critical for VR/AR and human-robot interaction. However, in-depth studies on modeling these behaviors are under-researched due to the lack of relevant datasets. We fill this gap by presenting CORE4D, a novel large-scale 4D human-object-human interaction dataset focusing on collaborative object rearrangement, which encompasses diverse compositions of various object geometries, collaboration modes, and 3D scenes. With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration retargeting strategy to augment motions to a variety of novel objects. Leveraging this approach, CORE4D comprises a total of 11K collaboration sequences spanning 3K real and virtual object shapes. Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis. Extensive experiments demonstrate the effectiveness of our collaboration retargeting strategy and indicate that CORE4D has posed new challenges to existing human-object interaction generation methodologies.
Poster
Alpár Cseke · Shashank Tripathi · Sai Kumar Dwivedi · Arjun Lakshmipathy · Agniv Chatterjee · Michael J. Black · Dimitrios Tzionas
[ ExHall D ]
Abstract
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways:(1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and …
Poster
Yiming Dou · Wonseok Oh · Yuqing Luo · Antonio Loquercio · Andrew Owens
[ ExHall D ]
Abstract
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands interacting with a 3D reconstruction of scene? We focus on human hands since they are versatile in their actions (e.g., tapping, scratching, patting) and very important to simulate human-like avatars in virtual reality applications. To predict the sound of hands, we train a video and hand-conditioned rectified flow model on a novel dataset of 3D-aligned hand-scene interactions with synchronized audio. Evaluation through psychophysical studies shows that our generated sounds are frequently indistinguishable from real sounds, outperforming baselines lacking hand pose or visual scene information. Through quantitative evaluations, we show that the generated sounds accurately convey material properties and actions. We will release our code and dataset to support further development in interactive 3D reconstructions.
Poster
Jinglei Zhang · Jiankang Deng · Chao Ma · Rolandos Alexandros Potamias
[ ExHall D ]
Abstract
Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models will be made publicly available for research purposes.
Poster
Jeonghwan Kim · Jisoo Kim · Jeonghyeon Na · Hanbyul Joo
[ ExHall D ]
Abstract
To enable machines to understand the way humans interact with the physical world in daily life, 3D interaction signals should be captured in natural settings, allowing people to engage with multiple objects in a range of sequential and casual manipulations. To achieve this goal, we introduce our ParaHo_x0008_me system designed to capture dynamic 3D movements of humans and objects within a common home environment. Our system features a multi-view setup with 70 synchronized RGB cameras, along with wearable motion capture devices including an IMU-based body suit and hand motion capture gloves. By leveraging the ParaHome system, we collect a new human-object interaction dataset, including 486 minutes of sequences across 208 captures with 38 participants, offering advancements with three key aspects: (1) capturing body motion and dexterous hand manipulation motion alongside multiple objects within a contextual home environment; (2) encompassing sequential and concurrent manipulations paired with text descriptions; and (3) including articulated objects with multiple parts represented by 3D parameterized models. We present detailed design justifications for our system, and perform key generative modeling experiments to demonstrate the potential of our dataset.
Poster
Jing Gao · Ce Zheng · Laszlo Jeni · Zackory Erickson
[ ExHall D ]
Abstract
In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.
Poster
Songpengcheng Xia · Yu Zhang · Zhuo Su · Xiaozheng Zheng · Zheng Lv · Guidong Wang · Yongjie Zhang · Qi Wu · Lei Chu · Ling Pei
[ ExHall D ]
Abstract
Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions.Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios. Code will be released upon paper acceptance.
Poster
German Barquero · Nadine Bertsch · Manojkumar Marramreddy · Carlos Chacón · Filippo Arcadu · Ferran Rigual · Nicky Sijia He · Cristina Palmero · Sergio Escalera · Yuting Ye · Robin Kips
[ ExHall D ]
Abstract
In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially …
Poster
Dong Wei · Xiaoning Sun · Xizhan Gao · Shengxiang Hu · Huaijiang Sun
[ ExHall D ]
Abstract
We investigate a new task in human motion prediction, which aims to forecast future body poses from historically observed sequences while accounting for arbitrary latency. This differs from existing works that assume an ideal scenario where future motions can be instantaneously'' predicted, thereby neglecting time delays caused by network transmission and algorithm execution. Addressing this task requires tackling two key challenges: The length of latency period can vary significantly across samples; the prediction model must be efficient. In this paper, we propose ALIEN, which treats the motion as a continuous function parameterized by a neural network, enabling predictions under any latency condition. By incorporating Mamba-like linear attention as a hyper-network and designing subsequent low-rank modulation, ALIEN efficiently learns a set of implicit neural representation weights from the observed motion to encode instance-specific information. Additionally, our model integrates the primary motion prediction task with an extra-designed variable-delay pose reconstruction task in a unified multi-task learning framework, enhancing its ability to capture richer motion patterns. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines adapted for our new task, while maintaining competitive performance in traditional prediction setting.
Poster
Cecilia Curreli · Dominik Muhle · Abhishek Saroha · Zhenzhang Ye · Riccardo Marin · Daniel Cremers
[ ExHall D ]
Abstract
Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics.
Poster
Jianwei Tang · Hong Yang · Tengyue Chen · Jian-Fang Hu
[ ExHall D ]
Abstract
Action-driven stochastic human motion prediction aims to generate future motion sequence of a pre-defined target action based on given past observed sequences performing non-traget actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the vary transition speed of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicting results being unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equiped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain action. To fuse the features retrieved from two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms previous state-of-the-art.
Poster
Jiayi Su · Youhe Feng · Zheng Li · Jinhua Song · Yangfan He · Botao Ren · Botian Xu
[ ExHall D ]
Abstract
This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.
Poster
Dian Shao · Mingfei Shi · Shengda Xu · Haodong Chen · Yongle Huang · Binglu Wang
[ ExHall D ]
Abstract
Although remarkable progress has been achieved in video generation, synthesizing physically plausible human actions remains an unresolved challenge, especially when addressing fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as “two turns on one leg with the free leg optionally below horizontal” poses substantial difficulties for current video generation methods, which often fail to produce satisfactory results. To address this, we propose FinePhys, a Fine-grained human action generation framework incorporating Physics for effective skeletal guidance. Specifically, FinePhys first performs online 2D pose estimation and then accomplishes dimension lifting through in-context learning. Recognizing that such data-driven 3D pose estimations may lack stability and interpretability, we incorporate a physics-based module that re-estimates motion dynamics using Euler-Lagrange equations, calculating joint accelerations bidirectionally across the temporal dimension. The physically predicted 3D poses are then fused with data-driven poses to provide multi-scale 2D heatmap-based guidance for the video generation process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.
Poster
Ting-Hsuan Liao · Yi Zhou · Yu Shen · Chun-Hao P. Huang · Saayan Mitra · Jia-Bin Huang · Uttaran Bhattacharya
[ ExHall D ]
Abstract
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
Poster
Xuan Wang · Kai Ruan · Xing Zhang · Gaoang Wang
[ ExHall D ]
Abstract
Text-driven motion generation has made significant strides in recent years. However, most work has primarily focused on human motion, largely ignoring the rich and diverse behaviors exhibited by animals. Modeling animal motion has important applications in wildlife conservation, animal ecology, and biomechanics. Animal motion modeling presents unique challenges due to species diversity, varied morphological structures, and different behavioral patterns in response to identical textual descriptions.To address these challenges, we propose \textbf{AniMo} for text-driven animal motion generation. AniMo consists of two stages: motion tokenization and text-to-motion generation. In the motion tokenization stage, we encode motions using a joint-aware spatiotemporal encoder with species-aware feature adaptive modulation, enabling the model to adapt to diverse skeletal structures across species. In the text-to-motion generation stage, we employ masked modeling to jointly learn the mappings between textual descriptions and motion tokens.Additionally, we introduce AniMo4D, a large-scale dataset containing 78,149 motion sequences and 185,435 textual descriptions across 114 animal species.Experimental results show that AniMo achieves superior performance on both the AniMo4D and AnimalML3D datasets, effectively capturing diverse morphological structures and behavioral patterns across various animal species.
Poster
Yifeng Ma · Jinwei Qi · Chaonan Ji · Peng Zhang · Bang Zhang · Zhidong Deng · Liefeng Bo
[ ExHall D ]
Abstract
This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.
Poster
Ruineng Li · Daitao Xing · Huiming Sun · Yuanzhou Ha · Jinglin Shen · Chiuman Ho
[ ExHall D ]
Abstract
Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion's effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications ranging from films to short-form content.
Poster
Inès Hyeonsu Kim · Seokju Cho · Gabriel Huang · Jung Yi · Joon-Young Lee · Seungryong Kim
[ ExHall D ]
Abstract
Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information.In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Code and pre-trained weights will be publicly available.
Poster
Yuhong Zhang · Guanlin Wu · Ling-Hao Chen · Zhuokai Zhao · Jing Lin · Xiaoke Jiang · Jiamin WU · Zhuoheng Li · Hao Frank Yang · Haoqian Wang · Lei Zhang
[ ExHall D ]
Abstract
In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.
Poster
Daikun Liu · Lei Cheng · Teng Wang · Changyin Sun
[ ExHall D ]
Abstract
Recent learning-based methods for event-based optical flow estimation utilize cost volumes for pixel matching but suffer from redundant computations and limited scalability to higher resolutions for flow refinement. In this work, we take advantage of the complementarity between temporally dense feature differences of adjacent event frames and cost volume and present a lightweight event-based optical flow network (EDCFlow) to achieve high-quality flow estimation at a higher resolution. Specifically, an attention-based multi-scale temporal feature difference layer is developed to capture diverse motion patterns at high resolution in a computation-efficient manner. An adaptive fusion of high-resolution difference motion features and low-resolution correlation motion features is performed to enhance motion representation and model generalization. Notably, EDCFlow can serve as a plug-and-play refinement module for RAFT-like event-based methods to enhance flow details. Extensive experiments demonstrate that EDCFlow achieves a better model accuracy-complexity trade-off compared to existing methods, offering superior generalization.
Poster
Eric Kee · Adam Pikielny · Kevin Blackburn-Matzen · Marc Levoy
[ ExHall D ]
Abstract
We describe a system to remove real-world reflections from images for consumer photography. Our system operates on linear (RAW) photos, and accepts an optional contextual photo looking in the opposite direction (e.g., the "selfie" camera on a mobile device). This optional photo helps disambiguate what should be considered the reflection. The system is trained solely on synthetic mixtures of real-world RAW images, which we combine using a reflection simulation that is photometrically and geometrically accurate. Our system comprises a base model that accepts the captured photo and optional context photo as input, and runs at 256p, followed by an up-sampling model that transforms 256p images to full resolution. The system can produce images for review at 1K in 4.5 to 6.5 seconds on a MacBook or iPhone 14 Pro. We test on RAW photos that were captured in the field and embody typical consumer photos, and show that our RAW-image simulation yields SOTA performance.
Poster
yan zaoming · pengcheng lei · Tingting Wang · Faming Fang · Junkang Zhang · Yaomin Huang · Haichuan Song
[ ExHall D ]
Abstract
Blurry video frame interpolation (BVFI), which aims to generate high-frame-rate clear videos from low-frame-rate blurry inputs, is a challenging yet significant task in computer vision. Current state-of-the-art approaches typically rely on linear or quadratic models to estimate intermediate motion. However, these methods often overlook depth-related changes, such as object size and viewing angle variations, which occur as objects move through the scene. This paper proposes a novel approach to addressing this challenge by leveraging differential curves that can describe motion velocity and depth changes.Specifically, we introduce an explicit framework, termed Differential Curve-guided Blurry Video Multi-Frame Interpolation (DC-BMFI), to eliminate the effects of depth variation caused by object motion on BVFI task. In contrast to existing methods that utilize optical flow for 2D awareness, we introduce an MPNet submodule within the DC-BMFI framework to estimate 3D scene flow, thereby enhancing the awareness of depth and velocity changes.To estimate the 3D scene flow from video frames, we propose a submodule termed UBNet to transform video frames into 3D camera space point maps and then estimate scene flow between point maps.Extensive experiments demonstrate that the proposed DC-BMFI surpasses state-of-the-art performance in simulated and real-world datasets.
Poster
Wenbo Hu · Xiangjun Gao · Xiaoyu Li · Sijie Zhao · Xiaodong Cun · Yong Zhang · Long Quan · Ying Shan
[ ExHall D ]
Abstract
Estimating video depth in open-world scenarios is challenging due to the diversity of videos in appearance, content motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that can process extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
Poster
Baorui Ma · Huachen Gao · Haoge Deng · Zhengxiong Luo · Tiejun Huang · Lulu Tang · Xinlong Wang
[ ExHall D ]
Abstract
Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data --- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a …
Poster
Daniel Geng · Charles Herrmann · Junhwa Hur · Forrester Cole · Serena Zhang · Tobias Pfaff · Tatiana Lopez-Guevara · Yusuf Aytar · Michael Rubinstein · Chen Sun · Oliver Wang · Andrew Owens · Deqing Sun
[ ExHall D ]
Abstract
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse _or_ dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as _motion prompts_. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term _motion prompt expansion_. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance.
Poster
Ryan Burgert · Yuancheng Xu · Wenqi Xian · Oliver Pilarski · Pascal Clausen · Mingming He · Li Ma · Yitong Deng · Lingxiao Li · Mohsen Mousavi · Michael Ryoo · Paul Debevec · Ning Yu
[ ExHall D ]
Abstract
Generative modeling aims to transform chaotic noise into structured outputs that align with training data distributions. In this work, we enhance video diffusion generative models by introducing motion control as a structured component within latent space sampling. Specifically, we propose a novel real-time noise warping method that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, enabling fine-grained motion control independent of model architecture and guidance type. We fine-tune modern video diffusion base models and provide a unified paradigm for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. By leveraging a real-time noise-warping algorithm that preserves spatial Gaussianity while efficiently maintaining temporal consistency, we enable flexible and diverse motion control applications with minimal trade-offs in pixel quality and temporal coherence. Extensive experiments and user studies demonstrate the advantages of our method in terms of visual quality, motion controllability, and temporal consistency, making it a robust and scalable solution for motion-controllable video synthesis.
Poster
Karran Pandey · Yannick Hold-Geoffroy · Matheus Gadelha · Niloy J. Mitra · Karan Singh · Paul Guerrero
[ ExHall D ]
Abstract
Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator’s latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and sometimes even human predictions regarding plausibility and diversity. Code will be released upon acceptance.
Poster
Wonjoon Jin · Qi Dai · Chong Luo · Seung-Hwan Baek · Sunghyun Cho
[ ExHall D ]
Abstract
This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. Since optical flow can be directly estimated from videos, this approach allows for the use of arbitrary training videos without ground-truth camera parameters. To synthesize natural object motion while supporting detailed camera control, we divide the camera-controllable video generation task into two sub-problems: object motion synthesis and flow-conditioned video synthesis. Specifically, we introduce an object motion synthesis model along with 3D-structure-based warping to generate foreground and background motions, which are fed into the flow-conditioned video diffusion model as conditional input, enabling camera-controlled video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
Poster
David Junhao Zhang · Roni Paiss · Shiran Zada · Nikhil Karnad · David E. Jacobs · Yael Pritch · Inbar Mosseri · Mike Zheng Shou · Neal Wadhwa · Nataniel Ruiz
[ ExHall D ]
Abstract
Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.
Poster
Zhenghao Zhang · Junchao Liao · Menghao Li · Zuozhuo Dai · Bingxue Qiu · Siyu Zhu · Long Qin · Weizhi Wang
[ ExHall D ]
Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
Poster
Shengzhi Wang · Yingkang Zhong · Jiangchuan Mu · Kai WU · Mingliang Xiong · Wen Fang · Mingqing Liu · Hao Deng · Bin He · Gang Li · Qingwen Liu
[ ExHall D ]
Abstract
Due to control limitations in the denoising process and the lack of training, zero-shot video editing methods often struggle to meet user instructions, resulting in generated videos that are visually unappealing and fail to fully satisfy expectations. To address this problem, we propose Align-A-Video, a video editing pipeline that incorporates human feedback through reward fine-tuning. Our approach consists of two key steps: 1) Deterministic Reward Fine-tuning. To reduce optimization costs for expected noise distributions, we propose a deterministic reward tuning strategy. This method improves tuning stability by increasing sample determinism, allowing the tuning process to be completed in minutes; 2) Feature Propagation Across Frames. We optimize a selected anchor frame and propagate its features to the remaining frames, improving both visual quality and semantic fidelity. This approach avoids temporal consistency degradation from reward optimization. Extensive qualitative and quantitative experiments confirm the effectiveness of using reward fine-tuning in Align-A-Video, significantly improving the overall quality of video generation.
Poster
Xin Jin · Simon Niklaus · Zhoutong Zhang · Zhihao Xia · Chun-Le Guo · Yuting Yang · Jiawen Chen · Chongyi Li
[ ExHall D ]
Abstract
Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.
Poster
Yifan Bian · Chuanbo Tang · Li Li · Dong Liu
[ ExHall D ]
Abstract
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9\% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream.
Poster
Zihao Zhang · Haoran Chen · Haoyu Zhao · Guansong Lu · Yanwei Fu · Hang Xu · Zuxuan Wu
[ ExHall D ]
Abstract
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
Poster
Yuantong zhang · Zhenzhong Chen
[ ExHall D ]
Abstract
Space-time video resampling aims to conduct both spatial-temporal downsampling and upsampling processes to achieve high-quality video reconstruction.Although there has been much progress, some major challenges still exist, such as how to preserve motion information during temporal resampling while avoiding blurring artifacts, and how to achieve flexible temporal and spatial resampling factors. In this paper, we introduce an Invertible Motion Steganography Module (IMSM), designed to preserve motion information in high-frame-rate videos. This module embeds motion information from high-frame-rate videos into downsampled frames with lower frame rates in a visually imperceptible manner. Its reversible nature allows the motion information to be recovered, facilitating the reconstruction of high-frame-rate videos. Furthermore, we propose a 3D implicit feature modulation technique that enables continuous spatiotemporal resampling. With tailored training strategies, our method supports flexible frame rate conversions, including non-integer changes like 30 FPS to 24 FPS and vice versa. Extensive experiments show that our method significantly outperforms existing solutions across multiple datasets in various video resampling tasks with high flexibility. Codes will be made available upon publication.
Poster
Xingguang Zhang · Nicholas M Chimitt · Xijun Wang · Yu Yuan · Stanley H. Chan
[ ExHall D ]
Abstract
Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges.In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model’s capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The …
Poster
Yiran Xu · Taesung Park · Richard Zhang · Yang Zhou · Eli Shechtman · Feng Liu · Jia-Bin Huang · Difan Liu
[ ExHall D ]
Abstract
Video super-resolution (VSR) models achieve temporal consistency but often produce blurrier results than their image-based counterparts due to limited generative capacity. This prompts the question: can we adapt a generative image upsampler for VSR while preserving temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that combines high-frequency detail with temporal stability, building on the large-scale GigaGAN image upsampler. Simple adaptations of GigaGAN for VSR led to flickering issues, so we propose techniques to enhance temporal consistency. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with 8× upsampling.
Poster
Yunpeng Qu · Kun Yuan · Qizhi Xie · Ming Sun · Chao Zhou · Jian Wang
[ ExHall D ]
Abstract
Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing overall quality and can guide us in adopting finegrained enhancement or transcoding strategies. Due to the difficulty and heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that the overall quality is closely related to the local texture of different regions and their visual saliency, we propose a Saliency-guided Local Perception VQA (SLP-VQA) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of overall quality. Our framework extracts global visual saliency and allocates attention using FusionWindow Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. Compared to SOTA methods, SLP-VQA obtains significant improvements across multiple scenarios on five VQA benchmarks. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with …
Poster
Jianyi Wang · Zhijie Lin · Meng Wei · Yang Zhao · Ceyuan Yang · Chen Change Loy · Lu Jiang
[ ExHall D ]
Abstract
Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
Poster
Weichen Dai · wu hexing · xiaoyang weng · Yuxin Zheng · Yuhang Ming · Wanzeng Kong
[ ExHall D ]
Abstract
\begin{abstract}As a fundamental visual task, optical flow estimation has widespread applications in computer vision. However, it faces significant challenges under adverse lighting conditions, where low texture and noise make accurate optical flow estimation particularly difficult.In this paper, we propose an optical flow method that employs implicit image enhancement through multi-modal synergistic training. To supplement the scene information missing in the original low-quality image, we utilize a high-low frequency feature enhancement network. The enhancement network is implicitly guided by multi-modal data and the specific subsequent tasks, enabling the model to learn multi-modal knowledge that enhances feature information suitable for optical flow estimation during inference. By using RGBD multi-modal data, the proposed method avoids the reliance on the images captured from the same view, a common limitation in traditional image enhancement methods.During training, the encoded features extracted from the enhanced images are synergistically supervised by features from the RGBD fusion as well as by the optical flow task.Experiments conducted on both synthetic and real datasets demonstrate that the proposed method significantly improves performance on public datasets.
Poster
Yutong Wang · Jiajie Teng · Jiajiong Cao · Yuming Li · Chenguang Ma · Hongteng Xu · Dixin Luo
[ ExHall D ]
Abstract
As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details.Despite the significant advancement in video face enhancement, current methods still suffer from i) long processing time and ii) inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage I, we learn the model with a regularizer mitigating the codebook collapse problem.In Stage II, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos.Experiments conducted on the VFHQ-Test dataset demonstrate that our method …
Poster
Xinan Xie · Qing Zhang · Wei-Shi Zheng
[ ExHall D ]
Abstract
While event-based deblurring have demonstrated impressive results, they are actually impractical for consumer photos captured by cell phones and digital cameras that are not equipped with the event sensor. To address this issue, we in this paper propose a novel deblurring framework called the Event Generation Deblurring (EGDeblurring), which allows to effectively deblur an image by generating event guidance describing the motion information using a diffusion model. Specifically, we design a Motion Prior Generation Diffusion Model (MPG-Diff) and a Feature Extractor (FE) to produce prior information beneficial for the deblurring task, rather than generating the raw event representation.In order to achieve effective fusion of motion prior information with blurry images and produce high-quality results, we propose a Regression Deblurring network (RDNet) embedded with a Dual-Branch Fusion Block (DBFB) incorporated with a multi-branch cross-modal attention mechanism.Experiments on multiple datasets demonstrate that our method outperforms the state-of-the-art methods.
Poster
Yanis Benidir · Nicolas Gonthier · Clement Mallet
[ ExHall D ]
Abstract
Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. Such task remains poorly addressed even in the deep learning era: it either requires large volumes of annotated data – in the semantic case – or is limited to restricted datasets for binary set-ups. Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets is the key solution but still fails handling complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data …
Poster
Hyejin Oh · Woo-Shik Kim · Sangyoon Lee · YungKyung Park · Jewon Kang
[ ExHall D ]
Abstract
Multispectral (MS) images contain richer spectral information than RGB images due to their increased number of channels and are widely used for various applications. However, achieving accurate estimation in MS images remains challenging, as previous studies have struggled with spectral diversity and the inherent entanglement between the illuminant and surface reflectance spectra. To tackle these challenges, in this paper, we propose a novel illumination spectrum estimation technique for MS images via surface reflectance modeling and spatial-spectral feature generation (ISS). The proposed technique employs a learnable spectral unmixing (SU) block to enhance surface reflectance modeling, which was unattempted in the illumination spectrum estimation, and a feature mixing block to fuse spectral and spatial features of MS images with cross-attention. The features are refined iteratively and processed through a decoder to produce an illumination spectrum estimator. Experimental results demonstrate that the proposed technique achieves state-of-the-art performance in illumination spectrum estimation in various MS image datasets.
Poster
Jinyuan Liu · Bowei Zhang · Qingyun Mei · Xingyuan Li · Yang Zou · Zhiying Jiang · Long Ma · Risheng Liu · Xin Fan
[ ExHall D ]
Abstract
Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Network, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. OLeveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Feature Embedding Block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks.
Poster
Junming Hou · Xiaoyu Chen · Ran Ran · Xiaofeng Cong · Xinyang Liu · Jian Wei You · Liang-Jian Deng
[ ExHall D ]
Abstract
Pan-sharpening technology refers to generating a high-resolution (HR) multi-spectral (MS) image with broad applications by fusing a low-resolution (LR) MS image and HR panchromatic (PAN) image. While deep learning approaches have shown impressive performance in pan-sharpening, they generally require extensive hardware with high memory and computational power, limiting their deployment on resource-constrained satellites. In this study, we investigate the use of binary neural networks (BNNs) for pan-sharpening and observe that binarization leads to distinct information degradation across different frequency components of an image. Building on this insight, we propose a novel binary pan-sharpening network, termed BNNPan, structured around the Prior-Integrated Binary Frequency (PIBF) module that features three key ingredients: Binary Wavelet Transform Convolution, Latent Diffusion Prior Compensation, and Channel-wise Distribution Calibration. Specifically, the first decomposes input features into distinct frequency components using Wavelet Transform, then applies a “divide-and-conquer” strategy to optimize binary feature learning for each component, informed by the corresponding full-precision residual statistics. The second integrates a latent diffusion prior to compensate for compromised information during binarization, while the third performs channel-wise calibration to further refine feature representation. Our BNNPan, developed using the proposed techniques, achieves promising pan-sharpening performance while maintaining favorable computational overhead. Experiments on multiple remote sensing …
Poster
Xiaoyi Liu · Hao Tang
[ ExHall D ]
Abstract
We introduce DiffFNO, a novel framework for arbitrary-scale super-resolution that incorporates a Weighted Fourier Neural Operator (WFNO) enhanced by a diffusion process. DiffFNO's adaptive mode weighting mechanism in the Fourier domain effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details that are essential for super-resolution tasks.Additionally, we propose a Gated Fusion Mechanism to efficiently integrate features from the WFNO and an attention-based neural operator, enhancing the network's capability to capture both global and local image details. To further improve efficiency, DiffFNO employs a deterministic ODE sampling strategy called the Adaptive Time-step ODE Solver (AT-ODE), which accelerates inference by dynamically adjusting step sizes while preserving output quality.Extensive experiments demonstrate that DiffFNO achieves state-of-the-art results, outperforming existing methods across various scaling factors, including those beyond the training distribution, by a margin of 2–4 dB in PSNR. Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.
Poster
Haitao Wu · Qing Li · Changqing Zhang · Zhen He · Xiaomin Ying
[ ExHall D ]
Abstract
Can our brain signals faithfully reflect the original visual stimuli, even including high-frequency details? Although human perceptual and cognitive capacities enable us to process and remember visual information, these abilities are constrained by several factors, such as limited attentional resources and the finite capacity of visual working memory. When visual stimuli are processed by the human visual system into brain signals, some information is inevitably lost, leading to a discrepancy known as the \textbf{System GAP}.Additionally, perceptual and cognitive dynamics, along with technical noise in signal acquisition, reduce the fidelity of brain signals relative to the original visual stimuli, known as the \textbf{Random GAP}.When encoded brain signal representations are directly aligned with the corresponding pretrained image features, the System GAP and Random GAP between paired data challenge the model, requiring it to bridge these gaps.However, in the context of limited paired data, these gaps become difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the \textbf{Uncertainty-aware Blur Prior (UBP)}.It estimates the uncertainty within the paired data, reflecting the mismatch between brain signals and visual stimuli. Based on this uncertainty, UBP dynamically blurs the …
Poster
Jiuchen Chen · Xinyu Yan · Qizhi Xu · Kaiqi Li
[ ExHall D ]
Abstract
Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 × 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 × 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset will soon be made publicly available.
Poster
Woo Kyoung Han · Byeonghun Lee · Hyunmin Cho · Sunghoon Im · Kyong Hwan Jin
[ ExHall D ]
Abstract
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Source codes are included in the supplementary material.
Poster
Wei Long · Xingyu Zhou · Leheng Zhang · Shuhang Gu
[ ExHall D ]
Abstract
Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.
Poster
Yuxuan Jiang · Ho Man Kwan · jasmine peng · Ge Gao · Fan Zhang · Xiaoqing Zhu · Joel Sole · David Bull
[ ExHall D ]
Abstract
Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new Hierarchical encoding based Implicit Image Function for continuous image super-resolution, HIIF, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at www.github.com.
Poster
Yulu Bai · Jiahong Fu · Qi Xie · Deyu Meng
[ ExHall D ]
Abstract
Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network's representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task requirements. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches. We will release the source code once the paper is accepted for publication.
Poster
Fengjia Zhang · Samrudhdhi Rangrej · Tristan T Aumentado-Armstrong · Afsaneh Fazly · Alex Levinshtein
[ ExHall D ]
Abstract
Super-resolution (SR), a classical inverse problem in computer vision, is inherently ill-posed, inducing a distribution of plausible solutions for every input. However, the desired result is not simply the expectation of this distribution, which is the blurry image obtained by minimizing pixelwise error, but rather the sample with the highest image quality. A variety of techniques, from perceptual metrics to adversarial losses, are employed to this end. In this work, we explore an alternative: utilizing powerful non-reference image quality assessment (NR-IQA) models in the SR context. We begin with a comprehensive analysis of NR-IQA metrics on human-derived SR data, identifying both the accuracy (human alignment) and complementarity of different metrics. Then, we explore two methods of applying NR-IQA models to SR learning: (i) altering data sampling, by building on an existing multi-ground-truth SR framework, and (ii) directly optimizing a differentiable quality score. Our results demonstrate a more human-centric perception-distortion tradeoff, focusing less on non-perceptual pixelwise distortion, instead improving the balance between perceptual fidelity and human-tuned NR-IQA measures.
Poster
Long Ma · Tengyu Ma · Ziye Li · Yuetong Wang · Jinyuan Liu · Chengpei Xu · Risheng Liu
[ ExHall D ]
Abstract
Recently, enhancing image quality in the original RAW domain has garnered significant attention, with denoising and reconstruction emerging as fundamental tasks. Although some works attempt to couple these tasks, they primarily focus on multi-stage learning while neglecting task associativity within a broader parameter space, leading to suboptimal performance. This work introduces a novel approach by rethinking denoising and reconstruction from a "backbone-head" perspective, leveraging the stronger shared parameter space offered by the backbone, compared to the encoder used in existing works. We derive task-specific heads with fewer parameters to mitigate learning pressure. By incorporating Chromaticity-aware attention into the backbone and introducing an adaptive denoising prior during training, we enable simultaneous reconstruction and denoising. Additionally, we design a dual-head interaction module to capture the latent correspondence between the two tasks, significantly enhancing multi-task accuracy. Extensive experiments validate the superiority of our approach.
Poster
Lingchen Sun · Rongyuan Wu · Zhiyuan Ma · Shuaizheng Liu · Qiaosi Yi · Lei Zhang
[ ExHall D ]
Abstract
Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present \textbf{Pi}xel-level and \textbf{S}emantic-level \textbf{A}djustable \textbf{SR} (\textbf{PiSA-SR}), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results.We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the ℓ2-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, …
Poster
Xudong Li · Wenjie Nie · Yan Zhang · Runze Hu · Ke Li · Xiawu Zheng · Liujuan Cao
[ ExHall D ]
Abstract
In the Blind Image Quality Assessment (BIQA) field, accurately assessing the quality of authentically distorted images presents a substantial challenge due to the diverse distortion types in natural settings. Existing State-of-the-art IQA methods mix a sequence of distortions to entire images to establish distortion priors, but are inadequate for images with spatially local distortions. To address this, we introduce a novel IQA framework that employs knowledge distillation tailored to perceive spatially heterogeneous distortions, enhancing quality-distortion awareness. Specifically, we introduce a novel Block-wise Degradation Modelling approach that applies distinct distortions to different spatial blocks of an image, thereby expanding varied distortion priors. Following this, we present a Block-wise Aggregation and Filtering module that enables fine-grained attention to the quality perception within different distortion areas of the image. Furthermore, to enhance the granularity of distortion perception across various regions while maintaining quality perception, we incorporate strategies of Contrastive Knowledge Distillation and Affinity Knowledge Distillation to learn the distortion discrimination power and distortion correlation of different regions, respectively. Extensive experiments on seven standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods, i.e., achieving the PLCC values of 0.947 (↑1.2% vs. 0.936 in KADID) and 0.735 (↑8.3% vs. 0.679 …
Poster
Jinho Jeong · Sangmin Han · Jinwoo Kim · Seon Joo Kim
[ ExHall D ]
Abstract
In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space.Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition.Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs.To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details.Our extensive experiments demonstrate that LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent upsampling in preserving detail and sharpness.
Poster
Guangqian Guo · Yong Guo · Xuehui Yu · Wenbo Li · Yaoxing Wang · Shan Gao
[ ExHall D ]
Abstract
Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset. Codes and datasets will be available soon.
Poster
Yuhan Wang · Suzhi Bi · Ying-Jun Angela Zhang · Xiaojun Yuan
[ ExHall D ]
Abstract
The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.
Poster
Zhifu Tian · Tao Hu · Chaoyang Niu · Di Wu · Shu Wang
[ ExHall D ]
Abstract
Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects.
Poster
Tomer Garber · Tom Tirer
[ ExHall D ]
Abstract
In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such zero-shot'' restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation …
Poster
Zhi Jiang · Jingbo Hu · Ling Zhang · Gang Fu · Chunxia Xiao
[ ExHall D ]
Abstract
Despite significant advances in the field of specular highlight removal in recent years, existing methods predominantly focus on natural images, where highlights typically appear on raised or edged surfaces of objects. These highlights are often small and sparsely distributed. However, for text images such as cards and posters, the flat surfaces reflect light uniformly, resulting in large areas of highlights. Current methods struggle with these large-area highlights in text images, often producing severe visual artifacts or noticeable discrepancies between filled pixels and the original image in the central high-intensity highlight areas. To address these challenges, we propose the Hierarchical Adaptive Filtering Network (HAFNet). Our approach performs filtering at both the downsampled deep feature layer and the upsampled image reconstruction layer. By designing and applying the Adaptive Comprehensive Filtering Module (ACFM) and Adaptive Dilated Filtering Module (ADFM) at different layers, our method effectively restores semantic information in large-area specular highlight regions and recovers detail loss at various scales. The required filtering kernels are pre-generated by a prediction network, allowing them to adaptively adjust according to different images and their semantic content, enabling robust performance across diverse scenarios. Additionally, we utilize Unity3D to construct a comprehensive large-area highlight dataset featuring images with …
Poster
Yi Liu · Hao Zhou · Benlei Cui · Wenxiang Shang · Ran Lin
[ ExHall D ]
Abstract
Erase inpainting, or object removal, aims to precisely remove target objects within masked regions while preserving the overall consistency of the surrounding content. Despite diffusion-based methods have made significant strides in the field of image inpainting, challenges remain regarding the emergence of unexpected objects or artifacts. We assert that the inexact diffusion pathways established by existing standard optimization paradigms constrain the efficacy of object removal. To tackle these challenges, we propose a novel Erase Diffusion, termed EraDiff, aimed at unleashing the potential power of standard diffusion in the context of object removal. In contrast to standard diffusion, the EraDiff adapts both the optimization paradigm and the network to improve the coherence and elimination of the erasure results.We first introduce a Chain-Rectifying Optimization (CRO) paradigm, a sophisticated diffusion process specifically designed to align with the objectives of erasure. This paradigm establishes innovative diffusion transition pathways that simulate the gradual elimination of objects during optimization, allowing the model to accurately capture the intent of object removal. Furthermore, to mitigate deviations caused by artifacts during the sampling pathways, we develop a simple yet effective Self-Rectifying Attention (SRA) mechanism. The SRA calibrates the sampling pathways by altering self-attention activation, allowing the model to effectively …
Poster
Yichi Zhang · Zhihao Duan · Yuning Huang · Fengqing Zhu
[ ExHall D ]
Abstract
Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2\% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. The code will be made publicly available.
Poster
Yifan Zhou · Zeqi Xiao · Shuai Yang · Xingang Pan
[ ExHall D ]
Abstract
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.
Poster
Pascal Chang · Sergio Sancho · Jingwei Tang · Markus Gross · Vinicius C. Azevedo
[ ExHall D ]
Abstract
Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce _Laplacian Pyramid Warping_, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams [Geng et al. 2024] to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.
Poster
Sora Kim · Sungho Suh · Minsik Lee
[ ExHall D ]
Abstract
Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains.Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, i.e., conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
Poster
Lucas Relic · Roberto Azevedo · Yang Zhang · Markus Gross · Christopher Schroers
[ ExHall D ]
Abstract
Generative neural image compression supports data representation with extremely low bitrate, allowing clients to synthesize details and consistently producing highly realistic images. By leveraging the similarities between quantization error and additive noise, diffusion-based generative image compression codecs can be built using a latent diffusion model to denoise'' the artifacts introduced by quantization. However, we identify three critical gaps in previous approaches following this paradigm~(namely, the noise level, noise type, and discretization gaps) that result in the quantized data falling out of the data distribution known by the diffusion model. In this work, we propose a novel quantization-based forward diffusion process with theoretical foundations that tackles all three aforementioned gaps. We achieve this through universal quantization with a carefully tailored quantization schedule and a diffusion model trained with uniform noise. Compared to previous work, our proposal produces consistently realistic and detailed reconstructions, even at very low bitrates. In such a regime, we achieve the best rate-distortion-realism performance, outperforming previous related works.
Poster
Nick Stracke · Stefan Andreas Baumann · Kolja Bauer · Frank Fundel · Björn Ommer
[ ExHall D ]
Abstract
Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
Poster
Haosen Yang · Adrian Bulat · Isma Hadji · Hai X. Pham · Xiatian Zhu · Georgios Tzimiropoulos · Brais Martinez
[ ExHall D ]
Abstract
Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined FAM diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.
Poster
Yuchuan Tian · Jing Han · Chengcheng Wang · Yuchen Liang · Chao Xu · Hanting Chen
[ ExHall D ]
Abstract
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage.
Poster
Yushu Wu · Zhixing Zhang · Yanyu Li · Yanwu Xu · Anil Kag · Yang Sui · Huseyin Coskun · Ke Ma · Aleksei Lebedev · Ju Hu · Dimitris N. Metaxas · Yanzhi Wang · Sergey Tulyakov · Jian Ren
[ ExHall D ]
Abstract
We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
Poster
Ziqi Pang · Tianyuan Zhang · Fujun Luan · Yunze Man · Hao Tan · Kai Zhang · William Freeman · Yu-Xiong Wang
[ ExHall D ]
Abstract
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enabling random order is to insert a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports in-painting, outpainting and resolution extrapolation in a zero-shot manner.We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios.
Poster
Zijian Zhou · Shikun Liu · Xiao Han · Haozhe Liu · Kam Woh Ng · Tian Xie · Yuren Cong · Hang Li · Mengmeng Xu · Juan-Manuel Pérez-Rúa · Aditya Patel · Tao Xiang · Miaojing Shi · Sen He
[ ExHall D ]
Abstract
Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person’s appearance or pose.However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality.We attribute these distortions to inadequate attention to corresponding regions in the reference image.To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training.Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline.Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality.Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
Poster
Xiao Zhang · Ruoxi Jiang · Rebecca Willett · Michael Maire
[ ExHall D ]
Abstract
We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.
Poster
Myunsoo Kim · Donghyeon Ki · Seong-Woong Shim · Byung-Jun Lee
[ ExHall D ]
Abstract
As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.
Poster
Nanye Ma · Shangyuan Tong · Haolin Jia · Hexiang Hu · Yu-Chuan Su · Mingda Zhang · Xuan Yang · Yandong Li · Tommi Jaakkola · Xuhui Jia · Saining Xie
[ ExHall D ]
Abstract
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in large language models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of function evaluations (NFE), although the performance gains typically flatten after a few dozen steps. In this work, we present a framework on the inference-time scaling for diffusion models, that enables diffusion models to further benefit from the increased computation beyond the NFE plateau. Specifically, we consider a search problem aimed at identifying better noises during the sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be …
Poster
Hermann Kumbong · Xian Liu · Tsung-Yi Lin · Ming-Yu Liu · Xihui Liu · Ziwei Liu · Daniel Y Fu · Christopher Re · David W. Romero
[ ExHall D ]
Abstract
Visual AutoRegressive modeling (VAR) shows promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolutions) scales. However, this formulation suffers from reduced image quality due to parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the resolution sampling schedule.We introduce H_ierarchical M_asked A_utoR_egressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256×256 and 512×512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware …
Poster
Liao Qu · Huichao Zhang · Yiheng Liu · Xu Wang · Yi Jiang · Yiming Gao · Hu Ye · Daniel Kang Du · Zehuan Yuan · Xinglong Wu
[ ExHall D ]
Abstract
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384×384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256×256 resolution, achieving comparable results to SDXL.
Poster
Subhadeep Koley · Tapas Kumar Dutta · Aneeshan Sain · Pinaki Nath Chowdhury · Ayan Kumar Bhunia · Yi-Zhe Song
[ ExHall D ]
Abstract
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35\%), recognition (+1.06\%), segmentation (+29.42\%), and correspondence learning (+21.22\%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
Poster
Roberto Henschel · Levon Khachatryan · Hayk Poghosyan · Daniil Hayrapetyan · Vahram Tadevosyan · Zhangyang Wang · Shant Navasardyan · Humphrey Shi
[ ExHall D ]
Abstract
Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, simplifying the process of producing diverse and individual content. Current methods excel in generating short videos (up to 16s), but produce hard-cuts when naively extended to long video synthesis. To overcome these limitations, we present StreamingT2V, an autoregressive method that generates long videos of up to 2 minutes or longer with seamless transitions. The key components are: (i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the preceding chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module (APM), which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that allows for the autoregressive application of a video enhancer on videos of indefinite length, ensuring consistency across chunks. Experiments show that StreamingT2V produces more motion, while competing methods suffer from video stagnation when applied naively in an autoregressive fashion. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator, surpassing competitors in both consistency and motion.
Poster
Hongjie Wang · Chih-Yao Ma · Yen-Cheng Liu · Ji Hou · Tao Xu · Jialiang Wang · Felix Juefei-Xu · Yaqiao Luo · Peizhao Zhang · Tingbo Hou · Peter Vajda · Niraj Jha · Xiaoliang Dai
[ ExHall D ]
Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6\% win rate) in video quality with up to 15× (11.5×) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video …
Poster
Yukun Wang · Longguang Wang · Zhiyuan Ma · Qibin Hu · Kai Xu · Yulan Guo
[ ExHall D ]
Abstract
Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control strategy is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.
Poster
Dohun Lee · Bryan Sangwoo Kim · Geon Yeong Park · Jong Chul Ye
[ ExHall D ]
Abstract
Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://videoguide2025.github.io/
Poster
Xi Wang · Robin Courant · Marc Christie · Vicky Kalogeiton
[ ExHall D ]
Abstract
Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future camera diffusion methods.
Poster
Mingi Kwon · Shin seong Kim · Jaeseok Jeong · Yi-Ting Hsiao · Youngjung Uh
[ ExHall D ]
Abstract
Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from xt to xt−1, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.
Poster
Zixuan Ye · Huijuan Huang · Xintao Wang · Pengfei Wan · Di ZHANG · Wenhan Luo
[ ExHall D ]
Abstract
Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images.
Poster
Nadav Z. Cohen · Oron Nir · Ariel Shamir
[ ExHall D ]
Abstract
Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.
Poster
Yufan Ren · Zicong Jiang · Tong Zhang · Søren Forchhammer · Sabine Süsstrunk
[ ExHall D ]
Abstract
Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications such as loss of local details and color alterations. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables selective optimization of specific frequency bands within spatially localized regions, allowing for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications across different levels of detail. To extend the applicability of our approach, we also provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, achieving frequency-aware adjustments for 3D textures editing. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits. Code will be released upon publication.
Poster
Fengyi Fu · Lei Zhang · Mengqi Huang · Zhendong Mao
[ ExHall D ]
Abstract
Text-based image editing which aims at generating rigid or non-rigid changes to images conditioned on the given text, has recently attracted considerable interest. Previous works mainly follow the multi-step denoising diffusion paradigm, which adopts a fixed text guidance intensity (i.e., editing intensity) to inject textual features, while ignoring the step-specific editing requirements. This work argues that the editing intensity at each denoising step should be adaptively adjusted conditioned on the historical editing degree, to provide accurate text guidance for the whole denoising process. We thereby propose a novel feedback editing framework (FeedEdit), a training-free method which, explicitly exploits the feedback regulation on editing intensity to ensure precise and harmonious editing at all steps. Specifically, we design (1) Dynamic Editing Degree Perceiving module, which is based on specific frequency-domain filtering, to enhance and exploit the correlation between feature differences and editing degree for perceiving. (2) Proportional-Integral feedback controller, to automatically map the perceived editing errors into appropriate feedback control signals. (3) Phrase-level Regulating Strategy, to achieve fine-grained function-specific regulation of textual features. Extensive experiments demonstrate the superiority of FeedEdit over existing methods in both editability and quality, especially for multi-function editing scenarios.
Poster
Duong H. Le · Tuan Pham · Sangho Lee · Christopher Clark · Aniruddha Kembhavi · Stephan Mandt · Ranjay Krishna · Jiasen Lu
[ ExHall D ]
Abstract
We introduce \texttt{OneDiffusion} - a single large-scale diffusion model designed to tackle a wide range of image synthesis and understanding tasks. It can generate images conditioned on text, depth, pose, layout, or semantic maps. It also handles super-resolution, multi-view generation, instant personalization, and text-guided image editing. Additionally, the same model is also effective for image understanding tasks such as depth estimation, pose estimation, open-vocab segmentation and camera pose estimation, etc.Our unified model is trained with simple but effective recipe: we casting all tasks as modeling a sequence of frames where we inject different noise scales during training to each frame. During inference, we can set any frame to a clean image for tasks like conditional image generation, camera pose estimation, or generating paired depth and images from a caption. Experiment results shows our proposed method achieve competitive performance on wide range of tasks. By eliminating the need for specialized architectures or finetuning, OneDiffusion provides enhanced flexibility and scalability, reflecting the emerging trend towards general-purpose diffusion models in the field.
Poster
Yanfeng Li · Ka-Hou Chan · Yue Sun · Chan-Tong Lam · Tong Tong · Zitong YU · Keren Fu · Xiaohong Liu · Tao Tan
[ ExHall D ]
Abstract
Multi-object images are widely present in the real world, spanning various areas of daily life. Efficient and accurate editing of these images is crucial for applications such as augmented reality, advertisement design, and medical imaging. Stable Diffusion (SD) has ushered in a new era of high-quality image generation and editing. However, existing methods struggle to analyze abstract relationships between multiple objects, often yielding suboptimal performance. To address this, we propose MoEdit, an auxiliary-free method for multi-object image editing. This method enables high-quality editing of attributes in multi-object images, such as style, object features, and background, by maintaining quantity consistency between the input and output images. To preserve the concept of quantity, we introduce a Quantity Attention (QTTN) module to control the editing process. Additionally, we present a Feature Compensation (FeCom) module to enhance the robustness of object preservation. We also employ a null-text training strategy to retain the guiding capability of text prompts, making them plug-and-play modules. This method leverages the powerful image perception and generation capabilities of the SD model, enabling specific concepts to be preserved and altered between input and output images. Experimental results demonstrate that MoEdit achieves the state-of-the-art performance in high-quality multi-object image editing. Code will …
Poster
Yingjing Xu · Jie Kong · Jiazhi Wang · Xiao Pan · Bo Lin · Qiang Liu
[ ExHall D ]
Abstract
In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
Poster
Mingdeng Cao · Xuaner Zhang · Yinqiang Zheng · Zhihao Xia
[ ExHall D ]
Abstract
This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics—such as non-rigid subject motion and complex camera movements—that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
Poster
Mushui Liu · Dong She · Qihan Huang · Jiacheng Ying · Wanggui He · Jingxuan Pang · Yuanlei Hou · Siming Fu
[ ExHall D ]
Abstract
Subject-driven image personalization has seen notable advancements, especially with the advent of the ReferenceNet paradigm. ReferenceNet excels in integrating image reference features, making it highly applicable in creative and commercial settings. However, current implementations of ReferenceNet primarily operate as latent-level feature extractors, which limit their potential. This constraint hinders the provision of appropriate features to the denoising backbone across different timesteps, leading to suboptimal image consistency. In this paper, we revisit the extraction of reference features and propose TFCustom, a model framework designed to focus on reference image features at different temporal steps and frequency levels. Specifically, we firstly propose synchronized ReferenceNet to extract reference image features while simultaneously optimizing noise injection and denoising for the reference image. We also propose a time-aware frequency feature refinement module that leverages high- and low-frequency filters, combined with time embeddings, to adaptively select the degree of reference feature injection. Additionally, to enhance the similarity between reference objects and the generated image, we introduce a novel reward-based loss that encourages greater alignment between the reference and generated images. Experimental results demonstrate state-of-the-art performance in both multi-object and single-object reference generation, with significant improvements in texture and textual detail generation over existing methods.
Poster
Pengfei Zhou · Xiaopeng Peng · Jiajun Song · Chuanhao Li · Zhaopan Xu · Yue Yang · Ziyao Guo · Hao Zhang · Yuqi Lin · Yefei He · Lirui Zhao · Shuo Liu · Tianhua Li · Yuxuan Xie · Xiaojun Chang · Yu Qiao · Wenqi Shao · Kaipeng Zhang
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The benchmark, code and judge models will be released.
Poster
Edurne Bernal-Berdun · Ana Serrano · Belen Masia · Matheus Gadelha · Yannick Hold-Geoffroy · Xin Sun · Diego Gutierrez
[ ExHall D ]
Abstract
Images as an artistic medium often rely on specific camera angles and lens distortions to convey ideas or emotions; however, such precise control is missing in current text-to-image models. We propose an efficient and general solution that allows precise control over the camera when generating both photographic and artistic images. Unlike prior methods that rely on predefined shots, we rely solely on four simple extrinsic and intrinsic camera parameters, removing the need for pre-existing geometry, reference 3D objects, and multi-view data.We also present a novel dataset with more than 57,000 images, along with their text prompts and ground-truth camera parameters. Our evaluation shows precise camera control in text-to-image generation, surpassing traditional prompt engineering approaches.
Poster
Jialuo Li · Wenhao Chai · XINGYU FU · Haiyang Xu · Saining Xie
[ ExHall D ]
Abstract
This paper presents a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. We present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. We also introduce SciBench, an expert-annotated adversarial dataset comprising 30k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging SciBench, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human experts. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% based on SciScore.
Poster
Wataru Shimoda · Naoto Inoue · Daichi Haraguchi · Hayato Mitani · Seiichi Uchida · Kota Yamaguchi
[ ExHall D ]
Abstract
While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image.In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline.Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words.Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
Poster
Qihao Liu · Xi Yin · Alan L. Yuille · Andrew Brown · Mannat Singh
[ ExHall D ]
Abstract
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps …
Poster
Chao Feng · Ziyang Chen · Aleksander Holynski · Alexei A. Efros · Andrew Owens
[ ExHall D ]
Abstract
We show that the GPS tags within image metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images change their appearance throughout a city. In particular, we train a diffusion model to generate images conditioned on both GPS tag and text, and find that the resulting model learns to successfully vary the contents of its generated images using both control signals, such as by creating images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also show that GPS conditioning enables us to lift'' 3D models from 2D GPS-to-image models using score distillation sampling, without need for explicit camera pose estimation. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
Poster
Zijie Li · Henry Li · Yichun Shi · Amir Barati Farimani · Yuval Kluger · Linjie Yang · Peng Wang
[ ExHall D ]
Abstract
Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models.We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities.Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.
Poster
Rishubh Parihar · Vaibhav Agrawal · Sachidanand VS · R. Venkatesh Babu
[ ExHall D ]
Abstract
Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbf{compass} tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D-assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training, and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the …
Poster
Jiaxiu Jiang · Yabo Zhang · Kailai Feng · Xiaohe Wu · Wenbo Li · Renjing Pei · Fan Li · Wangmeng Zuo
[ ExHall D ]
Abstract
Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC2, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC2 enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC2 outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC2 can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available.
Poster
Bin Wu · Wuxuan Shi · Jinqiao Wang · Mang Ye
[ ExHall D ]
Abstract
Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM's generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM's feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.
Poster
Florinel Croitoru · Vlad Hondru · Radu Tudor Ionescu · Nicu Sebe · Mubarak Shah
[ ExHall D ]
Abstract
Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on six benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-757F.
Poster
Rui Zhao · Weijia Mao · Mike Zheng Shou
[ ExHall D ]
Abstract
Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bidirectional mappings between vision and language learned by the unified generative model to enable training on unpaired data for domain adaptation. Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. The model is optimized through cross-entropy loss computed at the cycle endpoints, where both endpoints share the same modality. This facilitates self-evolution of the model without reliance on annotated text-image pairs. Experimental results demonstrate that for tasks independent of paired knowledge, such as stylization, DoraCycle can effectively adapt the unified model using only unpaired data. For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation. The code and model weights will be released.
Poster
Cong Xie · Han Zou · Ruiqi Yu · Yan Zhang · Zhan Zhenpeng
[ ExHall D ]
Abstract
In this work, we are interested in achieving both high text controllability and overall appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's overall appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.
Poster
Yuanbo Yang · Jiahao Shao · Xinyang Li · Yujun Shen · Andreas Geiger · Yiyi Liao
[ ExHall D ]
Abstract
In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation.
Poster
Silin Gao · Sheryl Mathew · Li Mi · Sepideh Mamooler · Mengjie Zhao · Hiromi Wakaki · Yuki Mitsufuji · Syrielle Montariol · Antoine Bosselut
[ ExHall D ]
Abstract
Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with our VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
Poster
Bonan Li · Zicheng Zhang · Xingyi Yang · Xinchao Wang
[ ExHall D ]
Abstract
Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into …
Poster
Zeqi Gu · Yin Cui · Max Li · Fangyin Wei · Yunhao Ge · Jinwei Gu · Ming-Yu Liu · Abe Davis · Yifan Ding
[ ExHall D ]
Abstract
Designing 3D scenes is traditionally a challenging and laborious task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. We generate the 2D intermediary image from a scene description, extract object shapes and appearances, create 3D models, and assemble them into the final scene with geometry, position, and pose extracted from the same image. Being generalizable to a wide range of scenes and styles, ArtiScene is shown to outperform state-of-the-art benchmarks by a large margin in layout …
Poster
Jiaxin Ge · Zora Zhiruo Wang · Xuhui Zhou · Yi-Hao Peng · Sanjay Subramanian · Qinyue Tan · Maarten Sap · Alane Suhr · Daniel Fried · Graham Neubig · Trevor Darrell
[ ExHall D ]
Abstract
Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills to deliver insights. In this work, we study automated slide generation, where models produce slide presentations from natural language (NL) instructions of varying specificity. We first introduce our SLIDESBENCH benchmark, with 585 examples derived from slide decks across 10 domains. SLIDESBENCH supports evaluations that are both (i) reference-based to measure similarity to a target slide, and (ii) reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with varied models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create PRESENTER, an 8B LLAMA slide generation model trained on 7k (NL, slide code) pairs, and achieve results comparable to the strongest model GPT-4O. Beyond one-pass slide generation, we explore iterative design refinement with model self-critiques and effectively improve element layout in slides. We hope that our work could provide a basis for future work on generating structured visuals.
Poster
Xi Wang · Hongzhen Li · Heng Fang · YICHEN PENG · Haoran Xie · Xi Yang · Chuntao Li
[ ExHall D ]
Abstract
Image rendering from line drawings is vital in design and image generation technologies reduce costs, yet professional line drawings demand preserving complex details. Text prompts struggle with accuracy, and image translation struggles with consistency and fine-grained control. We present LineArt, a framework that transfers complex appearance onto detailed design drawings, facilitating design and artistic creation. It generates high-fidelity materials while preserving structural accuracy by simulating hierarchical visual cognition and integrating human artistic experience to guide the diffusion process.LineArt overcomes the limitations of current methods in terms of difficulty in fine-grained control and style degradation in design drawings. It requires no precise 3D modeling, physical property specs, or network training, making it more convenient for design tasks. LineArt consists of two stages: a multi-frequency lines fusion module to supplement the input design drawing with detailed structural information and a two-part painting process for Base Layer Shaping and Surface Layer Coloring. We also present a new design drawing dataset ProLines for evaluation. The experiments show that LineArt performs better in accuracy, realism, and material precision compared to SOTAs.
Poster
Siyuan Bian · Chenghao Xu · Yuliang Xiu · Artur Grigorev · Zhen Liu · Cewu Lu · Michael J. Black · Yao Feng
[ ExHall D ]
Abstract
We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garment sewing patterns from images or text descriptions. Unlike previous methods that often lack robustness and interactive editing capabilities, ChatGarment finetunes a VLM to produce GarmentCode, a JSON-based, language-friendly format for 2D sewing patterns, enabling both estimating and editing from images and text instructions. To optimize performance, we refine GarmentCode by expanding its support for more diverse garment types and simplifying its structure, making it more efficient for VLM finetuning. Additionally, we develop an automated data construction pipeline to generate a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs, empowering ChatGarment with strong generalization across various garment types. Extensive evaluations demonstrate ChatGarment’s ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications.
Poster
Haobin Zhong · Shuai He · Anlong Ming · Huadong Ma
[ ExHall D ]
Abstract
The Personalized Aesthetics Assessment (PAA) aims to accurately predict an individual's unique perception of aesthetics. With the surging demand for customization, PAA enables applications to generate personalized outcomes by aligning with individual aesthetic preferences. The prevailing PAA paradigm involves two stages: pre-training and fine-tuning, but it faces three inherent challenges: 1) The model is pre-trained using datasets of the Generic Aesthetics Assessment (GAA), but the collective preferences of GAA lead to conflicts in individualized aesthetic predictions. 2) The scope and stage of personalized surveys are related to both the user and the assessed object; however, the prevailing personalized surveys fail to adequately address assessed objects' characteristics. 3) During application usage, the cumulative multimodal feedback from an individual holds great value that should be considered for improving the PAA model but unfortunately attracts insufficient attention. To address the aforementioned challenges, we introduce a new PAA paradigm called PAA+, which is structured into three distinct stages: pre-training, fine-tuning, and domain-incremental learning. Furthermore, to better reflect individual differences, we employ a familiar and intuitive application, physique aesthetics assessment (PhysiqueAA), to validate the PAA+ paradigm. We propose a dataset called PhysiqueAA50K, consisting of over 50,000 fully annotated physique images. Furthermore, we develop a PhysiqueAA …
Poster
Zirun Guo · Tao Jin
[ ExHall D ]
Abstract
Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. Through comprehensive experiments, we demonstrate that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.
Poster
Qianlong Xiang · Miao Zhang · Yuzhang Shang · Jianlong Wu · Yan Yan · Liqiang Nie
[ ExHall D ]
Abstract
Diffusion models (DMs) have demonstrated exceptional generative capabilities across various domains, including image, video, and so on. A key factor contributing to their effectiveness is the high quantity and quality of data used during training. However, mainstream DMs now consume increasingly large amounts of data. For example, training a Stable Diffusion model requires billions of image-text pairs. This enormous data requirement poses significant challenges for training large DMs due to high data acquisition costs and storage expenses. To alleviate this data burden, we propose a novel scenario: using existing DMs as data sources to train new DMs with any architecture. We refer to this scenario as Data-Free Knowledge Distillation for Diffusion Models (DKDM), where the generative ability of DMs is transferred to new ones in a data-free manner. To tackle this challenge, we make two main contributions. First, we introduce a DKDM objective that enables the training of new DMs via distillation, without requiring access to the data. Second, we develop a dynamic iterative distillation method that efficiently extracts time-domain knowledge from existing DMs, enabling direct retrieval of training data without the need for a prolonged generative process. To the best of our knowledge, we are the first to explore …
Poster
Matan Rusanovsky · Shimon Malnick · Amir Jevnisek · Ohad Fried · Shai Avidan
[ ExHall D ]
Abstract
Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts.In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts.Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept.We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept.Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.
Poster
Basim Azam · Naveed Akhtar
[ ExHall D ]
Abstract
Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results. Our code and models will be made …
Poster
Yimeng Zhang · Tiancheng Zhi · Jing Liu · Shen Sang · Liming Jiang · Qing Yan · Sijia Liu · Linjie Luo
[ ExHall D ]
Abstract
The ability to synthesize personalized group photos and specify the positions of each identity offers immense creative potential. While such imagery can be visually appealing, it presents significant challenges for existing technologies. A persistent issue is identity (ID) leakage, where injected facial features interfere with one another, resulting in low face resemblance, incorrect positioning, and visual artifacts. Existing methods suffer from limitations such as the reliance on segmentation models, increased runtime, or a high probability of ID leakage. To address these challenges, we propose ID-Patch, a novel method that provides robust association between identities and 2D positions. Our approach generates an ID patch and ID embeddings from the same facial features: the ID patch is positioned on the conditional image for precise spatial control, while the ID embeddings integrate with text embeddings to ensure high resemblance. Experimental results demonstrate that ID-Patch surpasses baseline methods across metrics, such as face ID resemblance, ID-position association accuracy, and generation efficiency.Code and models will be made publicly available upon publication of the paper.
Poster
Hao Cheng · Erjia Xiao · Jiayan Yang · Jiahang Cao · Qiang Zhang · Jize Zhang · Kaidi Xu · Jindong Gu · Renjing Xu
[ ExHall D ]
Abstract
Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse.In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. Currently, to prevent this security threat, the various guard or defense methods that are proposed also focus on defending the language modality. However, in practical applications, threats in the visual modality, particularly in tasks involving the editing of real-world images, pose greater security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper uses a method named typographic attack to reveal that various image generation models also commonly face threats in the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models. \textcolor{red}{\textbf{Warning:} This paper includes explicit sexual content, discussions of pornography, racial terminology, and other content that may cause discomfort or distress. Potentially disturbing content has been blocked …
Poster
Xuanyu Zhang · Zecheng Tang · Zhipei Xu · Runyi Li · Youmin Xu · Bin Chen · Feng Gao · Jian Zhang
[ ExHall D ]
Abstract
With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose \textbf{OmniGuard}, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7\% in F1-Score under noisy conditions, and 14.8\% in average bit accuracy.
Poster
Yiren Song · Pei Yang · Hai Ci · Mike Zheng Shou
[ ExHall D ]
Abstract
Recently, zero-shot methods like InstantID have revolutionized identity-preserving generation. Unlike multi-image finetuning approaches such as DreamBooth, these zero-shot methods leverage powerful facial encoders to extract identity information from a single portrait photo, enabling efficient identity-preserving generation through a single inference pass. However, this convenience introduces new threats to the facial identity protection. This paper aims to safeguard portrait photos from unauthorized encoder-based customization. We introduce IDProtector, an adversarial noise encoder that applies imperceptible adversarial noise to portrait photos in a single forward pass. Our approach offers universal protection for portraits against multiple state-of-the-art encoder-based methods, including InstantID, IP-Adapter, and PhotoMaker, while ensuring robustness to common image transformations such as JPEG compression, resizing, and affine transformations. Experiments across diverse portrait datasets and generative models reveal that IDProtector generalizes effectively to unseen data and even closed-source proprietary models.
Poster
Mischa Dombrowski · Weitong Zhang · Hadrien Reynaud · Sarah Cechnicka · Bernhard Kainz
[ ExHall D ]
Abstract
Generative methods have reached a level of quality that is almost indistinguishable from real data.However, while individual samples may appear unique, generative models often exhibit limitations in covering the full data distribution. Unlike quality issues, diversity problems within generative models are not easily detected by simply observing single images or generated datasets, which means we need a specific measure to assess the diversity of these models. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively.Consequently, we perform an extensive search for the best feature extractors to assess diversity.Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art …
Poster
Tai Nguyen · Aref Azizpour · Matthew Stamm
[ ExHall D ]
Abstract
The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures—subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.
Poster
Jinwoo Kim · Sangmin Han · Jinho Jeong · Jiwoo Choi · Dongyoung Kim · Seon Joo Kim
[ ExHall D ]
Abstract
Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models.However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios comprehensively. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.
Poster
Dhananjaya Jayasundara · Sudarshan Rajagopalan · Yasiru Ranasinghe · Trac Tran · Vishal M. Patel
[ ExHall D ]
Abstract
Implicit Neural Representations (INRs) are increasingly recognized as a versatile data modality for representing discretized signals, offering benefits such as infinite query resolution and reduced storage requirements. Existing signal compression approaches for INRs typically employ one of two strategies: 1. direct quantization with entropy coding of the trained INR; 2. deriving a latent code on top of the INR through a learnable transformation. Thus, their performance is heavily dependent on the quantization and entropy coding schemes employed. In this paper, we introduce SINR, an innovative compression algorithm that leverages the patterns in the vector spaces formed by weights of INRs. We compress these vector spaces using a high-dimensional sparse code within a dictionary. Further analysis reveals that the atoms of the dictionary used to generate the sparse code do not need to be learned or transmitted to successfully recover the INR weights. We demonstrate that the proposed approach can be integrated with any existing INR-based signal compression technique. Our results indicate that SINR achieves substantial reductions in storage requirements for INRs across various configurations, outperforming conventional INR-based compression baselines. Furthermore, SINR maintains high-quality decoding across diverse data modalities, including images, occupancy fields, and Neural Radiance Fields.
Poster
Tiago Novello · Diana Aldana Moreno · André Araujo · Luiz Velho
[ ExHall D ]
Abstract
Sinusoidal neural networks have been shown effective as implicit neural representations (INRs) of low-dimensional signals, due to their smoothness and high representation capacity. However, initializing and training them remain empirical tasks which lack on deeper understanding to guide the learning process. To fill this gap, our work introduces a theoretical framework that explains the capacity property of sinusoidal networks and offers robust control mechanisms for initialization and training. Our analysis is based on a novel amplitude-phase expansion of the sinusoidal multilayer perceptron, showing how its layer compositions produce a large number of new frequencies expressed as integer combinations of the input frequencies. This relationship can be directly used to initialize the input neurons, as a form of spectral sampling, and to bound the network's spectrum while training. Our method, referred to as TUNER (TUNing sinusoidal nEtwoRks), greatly improves the stability and convergence of sinusoidal INR training, leading to detailed reconstructions, while preventing overfitting.
Poster
Yuki Kawana · Shintaro Shiba · Quan Kong · Norimasa Kobori
[ ExHall D ]
Abstract
We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject, scene, and objects and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject’s eyes are unavailable, such as when the subject is distant or facing away. Previous methods rely on either 2D appearance alone or limited spatial cues by using depth maps in non-learnable post-processing. Estimating 3D gaze direction from 2D observations in these settings is challenging; besides variations in subject, scene, and gaze direction, different camera poses produce various 2D appearances and 3D gaze directions, even when targeting the same 3D scene. To address this issue, we propose gaze-aware 3D context encoding. Our method represents the subject and scene as 3D poses and object positions as 3D context, learning spatial relationships in 3D space. Inspired by human vision, we geometrically align this context in a subject-centric space, significantly reducing the spatial complexity. Furthermore, we propose D3 (distance-direction-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments show substantial improvement, reducing mean angle error of leading baselines by 13\%–37\% on benchmark datasets in single-view settings.
Poster
Yunfeng Xiao · Xiaowei Bai · Baojun Chen · Hao Su · Hao He · Liang Xie · Erwei Yin
[ ExHall D ]
Abstract
3D Gaze estimation is a challenging task due to two main issues. First, existing methods focus on analyzing dense features (e.g., large areas of pixels), which are sensitive to local noise (e.g., light spots, blurs) and increase computational complexity. Second, an eyeball model can exhibit multiple gaze directions, and the coupled representation between gazes and models increases learning difficulty.To address these issues, we propose \textbf{De\textsuperscript{2}Gaze}, a lightweight and accurate model-aware 3D gaze estimation method. In De\textsuperscript{2}Gaze, we introduce two improvements for deformable and decoupled representation learning.Specifically, first, we propose a deformable sparse attention mechanism that can adapt sparse sampling points to attention areas to avoid local noise influences. Second, a spatial decoupling network with a double-branch decoding architecture is proposed to disentangle the invariant (e.g., eyeball radius, position) and variable (e.g., gaze, pupil, iris) features from latent space.Compared to existing methods, De\textsuperscript{2}Gaze only requires fewer sparse features, and achieves faster convergence speed, lower computational complexity, and higher accuracy in 3D gaze estimation.Qualitative and quantitative experiments show that De\textsuperscript{2}Gaze achieves state-of-the-art accuracy and semantic segmentation for 3D gaze estimation on the TEyeD dataset.
Poster
Tianyun Zhong · Chao Liang · Jianwen Jiang · Gaojie Lin · Jiaqi Yang · Zhou Zhao
[ ExHall D ]
Abstract
Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models.To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times.
Poster
Juncheng Wang · Chao Xu · Cheng Yu · Lei Shang · Zhe Hu · Shujun Wang · Liefeng Bo
[ ExHall D ]
Abstract
Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.
Poster
Inho Kim · YOUNGKIL SONG · Jicheol Park · Won Hwa Kim · Suha Kwak
[ ExHall D ]
Abstract
Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input.We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio.To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning.Also, we introduce cross-modal attention matching to further align local features of image and audio.Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.
Poster
Chen Liu · Liying Yang · Peike Li · Dadong Wang · Lincheng Li · Xin Yu
[ ExHall D ]
Abstract
Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, i.e., (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in …
Poster
Eitan Shaar · Ariel Shaulov · Gal Chechik · Lior Wolf
[ ExHall D ]
Abstract
In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis (AV2A), a model-agnostic approach that requires no further training and integrates an early fusion technique to retain richer multimodal interactions. AV2A also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that AV2A achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of AV2A …
Poster
Zitang Zhou · Ke Mei · Yu Lu · Tianyi Wang · Fengyun Rao
[ ExHall D ]
Abstract
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music. Project page: https://harmonyset.github.io/.
Poster
Sanchayan Santra · Vishal Chudasama · Pankaj Wasnik · Vineeth Balasubramanian
[ ExHall D ]
Abstract
Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network, which may not be ideal for the task. Furthermore, these methods overlook the issue of imbalanced event class distribution present in the data, negatively impacting performance in challenging scenarios. This paper demonstrates that an appropriately designed network, trained end-to-end, can outperform state-of-the-art (SOTA) methods. Particularly, we propose a network with a convolutional spatial-temporal feature extractor enhanced with our proposed and a long-range temporal module Adaptive Spatio-Temporal Refinement Module (ASTRM) and a long-range temporal module. ASTRM helps enhances the features with spatio-temporal information. Meanwhile the long-range temporal module helps in extracting global context from the data by modeling long-range dependencies. To address the class imbalance issue, we introduce the Soft Instance Contrastive (SoftIC) loss that promotes feature compactness and class separation. Extensive experiments show that the proposed method is efficient and outperforms the SOTA methods, specifically in more challenging settings.
Poster
Bingjie Gao · Xinyu Gao · Xiaoxue Wu · yujie zhou · Yu Qiao · Li Niu · Xinyuan Chen · Yaohui Wang
[ ExHall D ]
Abstract
Text-to-video (T2V) generative models trained on large-scale datasets have made remarkable advancements. Well-performed prompts are often model-specific and coherent with the distribution in training prompts for T2V models. Previous studies utilize large language models to optimize user-provided prompts consistent with distribution of training prompts directly, lack of refined guidance considering both prompt vocabulary and specific sentence format. In this paper, we introduce a retrieval-augmented prompt optimization framework for T2V generation. The user-provided prompts are augmented with relevant and diverse modifiers retrieved from a built relation graph, and then refactored into the format of training prompts through a fine-tuned language model. To balance the retrieval speed and vocabulary diversity of relation graph, we propose a two branches optimization mechanism to determine the better prompt optimized from our method or large language model directly. Extensive experiments demonstrate that our proposed method can effectively enhance the both static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for simple user-provided prompts.
Poster
Xin Yan · Yuxuan Cai · Qiuyue Wang · Yuan Zhou · Wenhao Huang · Huan Yang
[ ExHall D ]
Abstract
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances the content richness, maintains long-range coherence, and captures intricate textual details. All code and model weights will be made publicly available.
Poster
Zhengrong Yue · Shaobin Zhuang · Kunchang Li · Yanbo Ding · Yali Wang
[ ExHall D ]
Abstract
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions,based on an open style description of user query.To fill this gap,we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content.Via a concise video-to-shot prompting paradigm,it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree.Via a robust tree-of-thought searching paradigm,it allows our V-Stylist to precisely specify vague style preference in the open user query.(3) Style Artist leverages the matched model to render all the video shots into the required style.Via a novel multi-round self-reflection paradigm,it allows our V-Stylist to adaptively adjust detail control,according to the style requirement.With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover, we further construct a new …
Poster
Huiyu Duan · Qiang Hu · Wang Jiarui · Liu Yang · Zitong Xu · Lu Liu · Xiongkuo Min · Chunlei Cai · Tianxiao Ye · Xiaoyun Zhang · Guangtao Zhai
[ ExHall D ]
Abstract
The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets. Both FineVD and FineVQ will be released upon the publication.
Poster
Kevin Qinghong Lin · Mike Zheng Shou
[ ExHall D ]
Abstract
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce **VLog**, a novel video understanding framework that defines video narrations as a vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, **VLog** features three key innovations:1. **A Generative Retrieval Model** Marrying the language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search.2. **A Hierarchical Vocabulary** Derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).3. **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on **EgoSchema**, **COIN**, and **HiREST** further demonstrate the effectiveness of **VLog**, highlighting its ability to generate concise, contextually accurate, and efficient narrations. This offers a novel perspective on video understanding.
Poster
Zicheng Zhang · Ziheng Jia · Haoning Wu · Chunyi Li · Zijian Chen · Yingjie Zhou · Wei Sun · Xiaohong Liu · Xiongkuo Min · Weisi Lin · Guangtao Zhai
[ ExHall D ]
Abstract
With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of …
Poster
Wei Li · Bing Hu · Rui Shao · Leyang Shen · Liqiang Nie
[ ExHall D ]
Abstract
First-person video assistants are highly anticipated to enhance our daily life through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "**F**ast & **S**low Video-Language Thinker" as on**LI**ne vide**O** assista**N**t, **LION-FS**, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: **1) Fast Path: Routing-Based Response Determination** evaluates frame-by-frame whether a immediate response is necessary. To enhance responses determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and **2) Slow Path: Multi-granularity Keyframe Augmentation** optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency. The codes will be released soon.
Poster
Kaixuan Wu · Xinde Li · Xinglin Li · Chuanfei Hu · Guoliang Wu
[ ExHall D ]
Abstract
In this paper, a novel benchmark for audio-visual question answering continual learning (AVQACL) is introduced, aiming to study fine-grained scene understanding and spatial-temporal reasoning in videos under a continual learning setting. To facilitate this multimodal continual leaning task, we create two audio-visual question answering continual learning datasets, named Split-AVQA and Split-MUSIC-AVQA based on the AVQA and MUSIC-AVQA datasets, respectively. The experimental results suggest that the model exhibits limited cognitive and reasoning abilities and experiences catastrophic forgetting when processing three modalities simultaneously in a continuous data stream. To address above challenges, we propose a novel continual learning method that incorporates question-guided cross-modal information fusion (QCIF) to focus on question-relevant details for improved feature representation and task-specific knowledge distillation with spatial-temporal feature constraints (TKD-STFC) to preserve the spatial-temporal reasoning knowledge acquired from previous dynamic scenarios. Furthermore, a question semantic consistency constraint (QSCC) is employed to ensure that the model maintains a consistent understanding of question semantics across tasks throughout the continual learning process. Extensive experimental results on Split-AVQA and Split-MUSIC-AVQA datasets illustrate that our method achieves state-of-the-art audio-visual question answering continual learning performance. Code is available at Supplementary Material.
Poster
Huabin Liu · Filip Ilievski · Cees G. M. Snoek
[ ExHall D ]
Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
Poster
Ziyang Wang · Shoubin Yu · Elias Stengel-Eskin · Jaehong Yoon · Feng Cheng · Gedas Bertasius · Mohit Bansal
[ ExHall D ]
Abstract
Long-form video understanding has been a challenging task due to the high redundancy in video data and the abundance of query-irrelevant information. To tackle this challenge, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that VideoTree improves both reasoning accuracy and efficiency compared to existing methods. Specifically, VideoTree outperforms the existing training-free approaches on the popular EgoSchema and NExT-QA benchmarks with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. …
Poster
Haiyi Qiu · Minghe Gao · Long Qian · Kaihang Pan · Qifan Yu · Juncheng Li · Wenjie Wang · Siliang Tang · Yueting Zhuang · Tat-seng Chua
[ ExHall D ]
Abstract
Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatial-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSG guided the derivation of multi-step reasoning Question-Answer (QA) data with Chain-of-Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3\% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self-generated rationale-enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential. The code …
Poster
Kangsan Kim · Geon Park · Youngwan Lee · Woongyeong Yeo · Sung Ju Hwang
[ ExHall D ]
Abstract
Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications.
Poster
Zhuoming Liu · Yiquan Li · Khoi D Nguyen · Yiwu Zhong · Yin Li
[ ExHall D ]
Abstract
We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.
Poster
Shuming Liu · Chen Zhao · Tianqi Xu · Bernard Ghanem
[ ExHall D ]
Abstract
Large vision-language models (VLMs) have shown promising progress in various video understanding tasks. However, their potential for long-form video analysis is limited by their high computational resource requirements and constrained context windows. Traditional methods, particularly uniform frame sampling, often allocate resources to irrelevant content, reducing effectiveness in real-world scenarios. This paper introduces BOLT to BOost Large VLMs without additional Training through an extensive study of frame selection strategies for large VLMs. To provide a realistic evaluation of VLMs in long-form video understanding, we first present a multi-source retrieval evaluation setting. Our findings show that uniform sampling significantly underperforms when dealing with noisy contexts, highlighting the importance of selecting the right frames. Furthermore, we introduce several frame selection strategies based on query-frame similarity and analyze their effectiveness in enhancing VLM performance without retraining. We find that inverse transform sampling with refined query descriptions yields the most substantial performance improvement, boosting accuracy on the Video-MME benchmark from 49.94% to 53.8%. Our code will be released.
Poster
Zhenpeng Huang · Xinhao Li · Jiaqi Li · Jing Wang · Xiangyu Zeng · Cheng Liang · Tao Wu · Xi Chen · Liang Li · Limin Wang
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) have shown significant progress in video understanding. However, applying these models to real-world streaming scenarios, such as autonomous driving, augmented reality, and surveillance, presents unique challenges due to real-time dynamics. This paper introduces the Online Spatial-Temporal Video Understanding Benchmark, OVBench, designed specifically to evaluate models’ capacity to interpret spatiotemporal features in streaming contexts. Unlike existing offline benchmarks, OVBench integrates six core task types, each defined across three temporal contexts (past, current, and future) to capture real-time complexity, resulting in a comprehensive set of 15 subtasks based on diverse datasets. To address the unique data constraints and architectural challenges in streaming video understanding, we present our strong baseline model, VideoChat-Online. This model incorporates a hierarchical memory bank architecture that effectively balances spatial and temporal representation, achieving state-of-the-art performance across online and offline scenarios. Our approach surpasses existing state of art offline models Qwen2-VL 7B and online models Flash-VStream, by 4.19\% and 23.7\% on OVBench, respectively. The results demonstrate VideoChat-Online’s efficacy in providing real-time responses while maintaining high accuracy in spatiotemporal comprehension.
Poster
Gengyuan Zhang · Mang Ling Ada Fok · Jialu Ma · Yan Xia · Philip H.S. Torr · Daniel Cremers · Volker Tresp · Jindong Gu
[ ExHall D ]
Abstract
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries— especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a Surrogate Fine-tuning on pseudo-MQs strategy. We systematically benchmark 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse data domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization. Our code and dataset will be publicly available.
Poster
Junho Kim · Hyunjun Kim · Hosu Lee · Yong Man Ro
[ ExHall D ]
Abstract
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to …
Poster
Sheng Zhou · Junbin Xiao · Qingyun Li · Yicong Li · Xun Yang · Dan Guo · Meng Wang · Tat-seng Chua · Angela Yao
[ ExHall D ]
Abstract
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
Poster
Felix Vogel · Walid Bousselham · Anna Kukleva · Nina Shvetsova · Hilde Kuehne
[ ExHall D ]
Abstract
Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging as actions have less physical outline and are usually described by higher-level concepts.In this work, we propose VideoGEM, the first training-free action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to activity grounding. By doing so, we observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer’s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for video …
Poster
Aaryan Garg · Akash Kumar · Yogesh S. Rawat
[ ExHall D ]
Abstract
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1. Code and models will be released publicly.
Poster
Claudia Cuttano · Gabriele Trivigno · Gabriele Rosi · Carlo Masone · Giuseppe Averta
[ ExHall D ]
Abstract
Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip.Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, \ours, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of just 4.2 M parameters.
Poster
Nan Huang · Wenzhao Zheng · Chenfeng Xu · Kurt Keutzer · Shanghang Zhang · Angjoo Kanazawa · Qianqian Wang
[ ExHall D ]
Abstract
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects.
Poster
Haiyang Mei · Pengyu Zhang · Mike Zheng Shou
[ ExHall D ]
Abstract
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce two key innovations: (i) a novel memory-as-prompt mechanism that leverages object memory to ensure segmentation consistency across dynamic scenes; and (ii) a new memory filtering mechanism designed to select the most informative historical information, thereby avoiding errors and noise interference and enhancing the stability of segmentation. Comprehensive experiments demonstrate that our method achieves over 90\% of SAM 2's performance while using only 0.2\% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader …
Poster
Andrei Dumitriu · Florin Tatui · Florin Miron · Aakash Ralhan · Radu Tudor Ionescu · Radu Timofte
[ ExHall D ]
Abstract
Rip currents are strong, localized and narrow currents of water that flow outwards into the sea causing numerous beach-related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and a lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large-scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring 140 videos (204,056 frames), out of which 115 (161,600 frames) videos are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations, across multiple global locations, including the USA, Greece, Portugal, Romania, Costa Rica, Sri Lanka, New Zealand and Australia. Most videos are annotated at 5 FPS to ensure accuracy in dynamic scenarios, supplemented by an additional 25 videos (42,456 frames) without rip currents. We conduct comprehensive ablation studies on Mask R-CNN, Cascade Mask R-CNN, YOLACT, and YOLO11, fine-tuning these models for the task of rip current segmentation. Baseline results are reported across standard metrics, with a particular focus on the F2 score …
Poster
Olga Zatsarynna · Emad Bahrami · Yazan Abu Farha · Gianpiero Francesca · Jürgen Gall
[ ExHall D ]
Abstract
Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets—Breakfast, 50Salads, and Assembly101—while also significantly improving computational and memory efficiency.
Poster
yilong wang · Zilin Gao · Qilong Wang · Zhaofeng Chen · Peihua Li · Qinghua Hu
[ ExHall D ]
Abstract
Going beyond few-shot action recognition (FSAR), cross-domain FSAR (CDFSAR) has attracted recent research interests by solving the domain gap lying in source-to-target transfer learning. Existing CDFSAR methods mainly focus on joint training of source and target data to mitigate the side effect of domain gap. However, such kind of methods suffer from two limitations: First, pair-wise joint training requires retraining deep models in case of one source data and multiple target ones, which incurs heavy computation cost, especially for large source and small target data. Second, pre-trained models after joint training are adopted to target domain in a straightforward manner, hardly taking full potential of pre-trained models and then limiting recognition performance. To overcome above limitations, this paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Specifically, our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source. To effectively and efficiently explore the potential of pre-trained models in transferring to target domain, our TAMT proposes a Hierarchical Temporal Tuning Network (HTTN), whose core involves local temporal-aware adapters (TAA) and a global temporal-aware moment tuning (GTMT). Particularly, TAA learns few …
Poster
Shaopeng Yang · Jilong Wang · Saihui Hou · Xu Liu · Chunshui Cao · Liang Wang · Yongzhen Huang
[ ExHall D ]
Abstract
Gait sequences exhibit sequential structures and contextual relationships similar to those in natural language, where each element—whether a word or a gait step—is connected to its predecessors and successors. This similarity enables the transformation of gait sequences into "texts" containing identity-related information. Large Language Models (LLMs), designed to understand and generate sequential data, can thus be utilized for gait sequence modeling to enhance gait recognition performance. Leveraging these insights, we make a pioneering effort to apply LLMs to gait recognition, which we refer to as GaitLLM. Specifically, we propose the Gait-to-Language (G2L) module, which converts gait sequences into a textual format suitable for LLMs, and the Language-to-Gait (L2G) module, which maps the LLM's output back to the gait feature space, thereby bridging the gap between LLM outputs and gait recognition. Notably, GaitLLM leverages the powerful modeling capabilities of LLMs without relying on complex architectural designs, improving gait recognition performance with only a small number of trainable parameters. Our method achieves state-of-the-art results on four popular gait datasets—SUSTech1K, CCPG, Gait3D, and GREW—demonstrating the effectiveness of applying LLMs in this domain. This work highlights the potential of LLMs to significantly enhance gait recognition, paving the way for future research and practical applications.
Poster
Lorenzo Mur-Labadia · Jose J. Guerrero · Ruben Martinez-Cantin
[ ExHall D ]
Abstract
Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer’s engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV-FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor-based components while integrating both image and video-language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long-term, spatio-temporal scene understanding.
Poster
Shengeng Tang · Jiayi He · Lechao Cheng · Jingjing Wu · Dan Guo · Richang Hong
[ ExHall D ]
Abstract
Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
Poster
Zezeng Li · Xiaoyu Du · Na Lei · Liming Chen · Weimin Wang
[ ExHall D ]
Abstract
Adversarial attacks exploit the vulnerability of deep models against adversarial samples. Existing point cloud attackers are tailored to specific models, iteratively optimizing perturbations based on gradients in either a white-box or black-box setting. Despite their promising attack performance, they often struggle to produce transferable adversarial samples due to overfitting the specific parameters of surrogate models. To overcome this issue, we shift our focus to the data distribution itself and introduce a novel approach named NoPain, which employs optimal transport (OT) to identify the inherent singular boundaries of the data manifold for cross-network point cloud attacks. Specifically, we first calculate the OT mapping from noise to the target feature space, then identify singular boundaries by locating non-differentiable positions. Finally, we sample along singular boundaries to generate adversarial point clouds. Once the singular boundaries are determined, NoPain can efficiently produce adversarial samples without the need of iterative updates or guidance from the surrogate classifiers. Extensive experiments demonstrate that the proposed end-to-end method outperforms baseline approaches in terms of both transferability and efficiency, while also maintaining notable advantages even against defense strategies. The source code will be publicly available.
Poster
Li Lin · Santosh Santosh · Mingyang Wu · Xin Wang · Shu Hu
[ ExHall D ]
Abstract
AI-generated faces have enriched human life, such as entertainment, education, and art. However, they also pose misuse risks. Therefore, detecting AI-generated faces becomes crucial, yet current detectors show biased performance across different demographic groups. Mitigating biases can be done by designing algorithmic fairness methods, which usually require demographically annotated face datasets for model training. However, no existing dataset encompasses both demographic attributes and diverse generative methods simultaneously, which hinders the development of fair detectors for AI-generated faces. In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. Based on this dataset, we conduct the first comprehensive fairness benchmark to assess various AI face detectors and provide valuable insights and findings to promote the future fair design of AI face detectors. Our AI-Face dataset and benchmark code are publicly available at https://anonymous.4open.science/r/AI_Face_FairnessBench-E417.
Poster
Fengfan Zhou · Bangjie Yin · Hefei Ling · Qianyu Zhou · Wenxuan Wang
[ ExHall D ]
Abstract
Face Recognition (FR) models are vulnerable to adversarial examples that subtly manipulate benign face images, underscoring the urgent need to improve the transferability of adversarial attacks in order to expose the blind spots of these systems. Existing adversarial attack methods often overlook the potential benefits of augmenting the surrogate model with diverse initializations, which limits the transferability of the generated adversarial examples. To address this gap, we propose a novel method called Diverse Parameters Augmentation (DPA) attack method, which enhances surrogate models by incorporating diverse parameter initializations, resulting in a broader and more diverse set of surrogate models. Specifically, DPA consists of two key stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA). In the DPO stage, we initialize the parameters of the surrogate model using both pre-trained and random parameters. Subsequently, we save the models in the intermediate training process to obtain a diverse set of surrogate models. During the HMA stage, we enhance the feature maps of the diversified surrogate models by incorporating beneficial perturbations optimized in directions opposite to those of adversarial perturbations, thereby further improving the transferability. Experimental results demonstrate that our proposed attack method can effectively enhance the transferability of the crafted adversarial face …
Poster
Mohammad Saadabadi Saadabadi · Sahar Rahimi Malakshan · Ali Dabouei · Srinjoy Das · Jeremy M. Dawson · Nasser M Nasrabadi
[ ExHall D ]
Abstract
Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities.Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. In this work, we present a simple yet effective method that substitutes scalar labels with structured identity code, \ie, a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic \wrt number of identities. We demonstrate the benefits of the proposed method by conducting experiments on LFW, CFP-FP, CPLFW, CALFW, AgeDB, IJB-B, and IJB-C using different backbone network architectures. In particular, with less training computational load, our method outperforms its competitors by 1.52\%, and 0.6\% at TAR@FAR=1e−4 on …
Poster
Li Lun · Kunyu Feng · Qinglong Ni · Ling Liang · Yuan Wang · Ying Li · dunshan yu · Xiaoxin CUI
[ ExHall D ]
Abstract
Spiking neural networks (SNNs) have shown their competence in handling spatial-temporal event-based data with low energy consumption. Similar to conventional artificial neural networks (ANNs), SNNs are also vulnerable to gradient-based adversarial attacks, wherein gradients are calculated by spatial-temporal back-propagation (STBP) and surrogate gradients (SGs). However, the SGs may be invisible for an inference-only model as they do not influence the inference results, and current gradient-based attacks are ineffective for binary dynamic images captured by the dynamic vision sensor (DVS). While some approaches addressed the issue of invisible SGs through universal SGs, their SGs lack a correlation with the victim model, resulting in sub-optimal performance. Moreover, the imperceptibility of existing SNN-based binary attacks is still insufficient. In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. Additionally, we propose the sparse dynamic attack (SDA) to effectively attack binary dynamic images. Utilizing a generation-reduction paradigm, SDA can fully optimize the sparsity of adversarial perturbations. Experimental results demonstrate that our PDSG and SDA outperform state-of-the-art SNN-based attacks across various models and datasets. Specifically, our PDSG achieves 100% …
Poster
Ziqi Li · Tao Gao · Yisheng An · Ting Chen · Jing Zhang · Yuanbo Wen · Mengkun Liu · Qianxi Zhang
[ ExHall D ]
Abstract
Brain-inspired spiking neural networks (SNNs) have the capability of energy-efficient processing of temporal information. However, leveraging the rich dynamic characteristics of SNNs and prior works in artificial neural networks (ANNs) to construct an effective object detection model for visual tasks remains an open question for further exploration. To develop a directly-trained , low energy consumption and high-performance multi-scale SNN model, we propose a novel interpretable object detection framework Multi-scale Spiking Detector (MSD). Initially, we propose a spiking convolutional neuron as a core component of the Optic Nerve Nucleus Block (ONNB), designed to significantly enhance the deep feature extraction capabilities of SNNs. ONNB enables direct training with improved energy efficiency, demonstrating superior performance compared to state-of-the-art ANN-to-SNN conversion and SNN techniques. In addition, we propose a Multi-scale Spiking Detection Framework to emulate the biological response and comprehension of stimuli from different objects. Wherein, spiking multi-scale fusion and the spiking detector are employed to integrate features across different depths and to detect response outcomes, respectively. Our method outperforms state-of-the-art ANN detectors, with only 7.8 M parameters and 6.43 mJ energy consumption. MSD obtains the mean average precision (mAP) of 62.0% and 66.3% on COCO and Gen1 datasets, respectively.
Poster
Tian Gao · Yu Zhang · Zhiyuan Zhang · Huajun Liu · Kaijie Yin · Cheng-Zhong Xu · Hui Kong
[ ExHall D ]
Abstract
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN), offering a potential solution to the deployment challenges faced by Vision Transformers (ViTs) on edge devices. However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Initially, BHViT utilizes the local information interaction and hierarchical feature aggregation technique from coarse to fine levels to address redundant computations stemming from excessive tokens. Then, a novel module based on shift operations is proposed to enhance the performance of the binary Multilayer Perceptron (MLP) module without significantly increasing computational overhead. In addition, an innovative attention matrix binarization method based on quantization decomposition is proposed to evaluate the token's importance in the binarized attention matrix. Finally, we propose a regularization loss to address the inadequate optimization caused by the incompatibility between the weight oscillation in the binary layers and the Adam Optimizer. Extensive experimental results demonstrate that our proposed algorithm achieves SOTA performance among binary ViT …
Poster
Zhenyu Cui · Jiahuan Zhou · Yuxin Peng
[ ExHall D ]
Abstract
Lifelong person re-identification (LReID) aims to match the same person using sequentially collected data. However, due to the long-term nature of lifelong learning, the inevitable changes in human clothes prevent the model from relying on unified discriminative information (e.g., clothing style) to match the same person in the streaming data, demanding differentiated cloth-irrelevant information. Unfortunately, existing LReID methods typically fail to leverage such knowledge resulting in the exacerbation of catastrophic forgetting issues. Therefore, in this paper, we focus on a challenging practical task called Cloth-Hybrid Lifelong Person Re-identification (CH-LReID), which requires matching the same person wearing different clothes using sequentially collected data. A Differentiated Knowledge Consolidation (DKC) framework is designed to unify and balance distinct knowledge across streaming data. The core idea is to adaptively balance differentiated knowledge and compatibly consolidate cloth-relevant and cloth-irrelevant information. To this end, a Differentiated Knowledge Transfer (DKT) module and a Latent Knowledge Consolidation (LKC) module are designed to adaptively discover differentiated new knowledge, while eliminating the derived domain shift of old knowledge via reconstructing the old latent feature space, respectively. Then, to further alleviate the catastrophic conflict between differentiated new and old knowledge, we further propose a Dual-level Distribution Alignment (DDA) module to align …
Poster
Jingwei Zhang · Anh Tien Nguyen · Xi Han · Vincent Quoc-Huy Trinh · Hong Qin · Dimitris Samaras · Mahdi Hosseini
[ ExHall D ]
Abstract
Efficiently modeling large 2D contexts is essential for various fields including Giga-Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer-based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba improves up to 2.48% in AUC, 3.11% in F1 score, 2.47% in accuracy and 5.52% in C-index. Additionally, integrating our method with VMamba for natural imaging …
Poster
Jeffri Erwin Murrugarra Llerena · José Henrique Marques · Claudio Jung
[ ExHall D ]
Abstract
Oriented Object Detection (OOD) has received increased attention in the past years, being a suitable solution for detecting elongated objects in remote sensing analysis. In particular, using regression loss functions based on Gaussian distributions has become attractive since they yield simple and differentiable terms. However, existing solutions are still based on regression heads that produce Oriented Bounding Boxes (OBBs), and the known problem of angular boundary discontinuity persists. In this work, we propose a regression head for OOD that directly produces Gaussian distributions based on the Cholesky matrix decomposition. The proposed head, named Gaucho, theoretically mitigates the boundary discontinuity problem and is fully compatible with recent Gaussian-based regression loss functions. Furthermore, we advocate using Oriented Ellipses (OEs) to represent oriented objects, which relates to GauCho through a bijective function and alleviates the encoding ambiguity problem for circular objects. Our experimental results show that GauCho can be a viable alternative to the traditional OBB head, achieving results comparable to or better than state-of-the-art detectors for the challenging dataset DOTA. Our code will be available at \url{omitted-for-blind-review}.
Poster
Biplab Das · Viswanath Gopalakrishnan
[ ExHall D ]
Abstract
In this work, we introduce Camouflage Anything, a novel and robust approach to generate camouflaged datasets. To the best of our knowledge, we are the first to apply Controlled Out-painting and Representation Engineering (CO + RE) for generating realistic camouflaged images with an objective to hide any segmented object coming from a generic or salient database. Our proposed method uses a novel control design to out-paint a given segmented object, with a camouflaged background. We also uncover the role of representation engineering in enhancing the quality of generated camouflage datasets. We address the limitations of existing metrics FID and KID in capturing the 'camouflage quality',by proposing a novel metric namely, CamOT. CamOT uses Optimal Transport between foreground & background (boundary) Gaussian Mixture Models (GMM) of concerned camouflaged object to assign an image quality score. Furthermore, we conduct LoRA-based fine-tuning of the robust BiRefNet baseline with our generated camouflaged datasets, leading to notable improvements in camouflaged object segmentation accuracy. The experimental results showcase the efficacy and potential of Camouflage Anything, outperforming existing methods in camouflaged generation tasks.
Poster
Datao Tang · Xiangyong Cao · Xuan Wu · Jialin Li · Jing Yao · Xueru Bai · Dongsheng Jiang · Yin Li · Deyu Meng
[ ExHall D ]
Abstract
Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively.
Poster
Zhe Shan · Yang Liu · Lei Zhou · Cheng Yan · Heng Wang · Xia Xie
[ ExHall D ]
Abstract
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose RSVOS-SAM, a method designed to achieve high-quality interactive masks while preserving generalization across diverse remote sensing data. The RSVOS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM’s generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we redesign the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while RSVOS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, RSVOS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm RSVOS-SAM as a powerful tool for fine-grained segmentation in …
Poster
Phuc Nguyen · Minh Luu · Anh Tran · Cuong Pham · Khoi Nguyen
[ ExHall D ]
Abstract
Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
Poster
Joey Wilson · Marcelino M. de Almeida · Sachit Mahajan · Martin Labrie · Maani Ghaffari · Omid Ghasemalizadeh · Min Sun · Cheng-Hao Kuo · Arnab Sen
[ ExHall D ]
Abstract
In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D-GS) through P-Optimality. While 3D-GS has proven to be a useful world model with high-quality rasterizations, it does not natively quantify uncertainty. Quantifying uncertainty in parameters of 3D-GS is necessary to understand the information gained from acquiring new images as in active perception, or identify redundant images which can be removed from memory due to resource constraints in online 3D-GS SLAM. We propose to quantify uncertainty and information gain in 3D-GS by reformulating the problem through the lens of optimal experimental design, which is a classical solution to measuring information gain. By restructuring information quantification of 3D-GS through optimal experimental design, we arrive at multiple solutions, of which T-Optimality and D-Optimality perform the best quantitatively and qualitatively as measured on two popular datasets. Additionally, we propose a block diagonal approximation of the 3D-GS uncertainty, which provides a measure of correlation for computing more accurate information gain, at the expense of a greater computation cost.
Poster
Runsong Zhu · Shi Qiu · ZHENGZHE LIU · Ka-Hei Hui · Qianyi Wu · Pheng-Ann Heng · Chi-Wing Fu
[ ExHall D ]
Abstract
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods for addressing multi-view inconsistency in 2D segmentations either use linear assignment for end-to-end lifting, resulting in inferior results, or adopt a two-stage solution, which is limited by complex pre- or post-processing. In this work, we design a new object-aware lifting approach to realize an end-to-end lifting pipeline based on the 3D Gaussian representation, such that we can jointly learn Gaussian-level point features and a global object-level codebook across multiple views. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable \textit{object-level codebook} to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. Further, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our approach clearly outperforms existing methods in terms of segmentation quality and time efficiency.
Poster
Wenxuan Guo · Xiuwei Xu · Ziwei Wang · Jianjiang Feng · Jie Zhou · Jiwen Lu
[ ExHall D ]
Abstract
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with …
Poster
Yijie Tang · Jiazhao Zhang · Yuqing Lan · Yulan Guo · Dezun Dong · Chenyang Zhu · Kai Xu
[ ExHall D ]
Abstract
Online 3D open-vocabulary segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential—yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from O(n2) to O(n). Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, open-vocabulary 3D instance segmentation with leading efficiency.
Poster
Muzhi Zhu · Yuzhuo Tian · Hao Chen · Chunluan Zhou · Qingpei Guo · Yang Liu · Ming Yang · Chunhua Shen
[ ExHall D ]
Abstract
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM’s text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model’s intrinsic pixel-level understanding.Thus, We introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to SOTA methods and supports additional tasks like mask refinement and annotation filtering.HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs’ visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM guided tree search further enhance model robustness in …
Poster
Savya Khosla · Sethuraman T V · Alexander G. Schwing · Derek Hoiem
[ ExHall D ]
Abstract
We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Poster
Rong Li · Shijie Li · Lingdong Kong · Xulei Yang · Junwei Liang
[ ExHall D ]
Abstract
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. SeeGround represents 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7\% on ScanRefer and 7.1\% on Nr3D, showcasing its effectiveness in complex 3DVG task.
Poster
Zhenyang Liu · Yikai Wang · Sixiao Zheng · Tongying Pan · Longfei Liang · Yanwei Fu · Xiangyang Xue
[ ExHall D ]
Abstract
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. To address this, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the Segment Anything Model (SAM) and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
Poster
Ziyang Zhou · Pinghui Wang · Zi Liang · Haitao Bai · Ruofei Zhang
[ ExHall D ]
Abstract
The advancement of 3D understanding and representation is a crucial step for the next phase of autonomous driving, robotics, augmented and virtual reality, 3D gaming and 3D e-commerce products. However, existing research has primarily focused on point clouds to perceive 3D objects and scenes, overlooking the rich visual details offered by multi-view images, thereby limiting the potential of 3D semantic representation. This paper introduces OpenView, a novel representation method that integrates both point clouds and multi-view images to form a unified 3D representation. OpenView comprises a unique fusion framework, sequence-independent modeling, a cross-modal fusion encoder, and a progressive hard learning strategy. Our experiments demonstrate that OpenView outperforms the state-of-the-art by 11.5% and 5.5% on the R@1 metric for cross-modal retrieval and the Top-1 metric for zero-shot classification tasks, respectively. Furthermore, we showcase some applications of OpenView: 3D retrieval, 3D captioning and hierarchical data clustering, highlighting its generality in the field of 3D representation learning.
Poster
Austin Stone · Hagen Soltau · Robert Geirhos · Xi Yi · Ye Xia · Bingyi Cao · Kaifeng Chen · Abhijit Ogale · Jonathon Shlens
[ ExHall D ]
Abstract
Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on …
Poster
Keyu Guo · Yongle Huang · Shijie Sun · Xiangyu Song · Mingtao Feng · Zedong Liu · Huansheng Song · Tiantian Wang · Jianxin Li · Naveed Akhtar · Ajmal Mian
[ ExHall D ]
Abstract
Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding. However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world. To this end, we construct a large-scale benchmark dataset, MonoMulti3D-ROPE, and propose a model, CyclopsNet that integrates a State-Prompt Visual Encoder (SPVE) module with a Denoising Alignment Fusion (DAF) module to achieve robust multi-modal semantic alignment and fusion. This leads to more stable and robust multi-modal joint representations for downstream tasks. Experimental results show that our method significantly outperforms existing techniques on the MonoMulti3D-ROPE dataset. Our dataset and code are available in Supplementary Material to ease reproducibility.
Poster
Hongyan Zhi · Peihao Chen · Junyan Li · Shuailei Ma · Xinyu Sun · Tianhang Xiang · Yinjie Lei · Mingkui Tan · Chuang Gan
[ ExHall D ]
Abstract
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objectsand consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large …
Poster
Jiajun Deng · Tianyu He · Li Jiang · Tianyu Wang · Feras Dayoub · Ian Reid
[ ExHall D ]
Abstract
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce \textbf{3D-LLaVA}, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines—such as offline multi-view feature extraction or additional task-specific heads—3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a \textbf{visual feature selector} that converts and selects visual tokens, (2) a \textbf{visual prompt encoder} that embeds interactive visual prompts into the visual token space, and (3) a \textbf{referring mask decoder} that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks. The code and model will be …
Poster
Benlin Liu · Yuhao Dong · Yiqin Wang · Zixian Ma · Yansong Tang · Luming Tang · Yongming Rao · Wei-Chiu Ma · Ranjay Krishna
[ ExHall D ]
Abstract
Multimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural designs or task-specific fine-tuning to achieve this. We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs’ spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints, and then conveys this information to MLLMs through visual prompting. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks that require spatial-temporal reasoning, including +20.5% improvement on ScanQA, +9.7% on OpenEQA’s episodic memory subset, +6.0% on the long-form video benchmark EgoSchema, and +11% on the R2R navigation benchmark. Additionally, we show that Coarse Correspondences can also enhance open-source MLLMs’ spatial reasoning (by +6.9% on ScanQA) when applied in both training and inference and that the improvement can generalize to unseen datasets such as SQA3D (+3.1%). Taken together, we show that Coarse Correspondences effectively and efficiently boosts models’ performance on downstream tasks requiring spatial-temporal reasoning.
Poster
Efstathios Karypidis · Ioannis Kakogeorgiou · Spyros Gidaris · Nikos Komodakis
[ ExHall D ]
Abstract
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting.
Poster
Weiming Ren · Huan Yang · Jie Min · Cong Wei · Wenhu Chen
[ ExHall D ]
Abstract
Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective video spatiotemporal augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework. Our dataset, code and finetuned models will be released.
Poster
Haoqiang Kang · Enna Sachdeva · Piyush Gupta · Sangjae Bae · Kwonjoon Lee
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.
Poster
Cheng Chen · Yunpeng Zhai · Yifan Zhao · Jinyang Gao · Bolin Ding · Jia Li
[ ExHall D ]
Abstract
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify that our approach achieves significant performance improvements on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs. The code will be publicly available.
Poster
Mahtab Bigverdi · Zelun Luo · Cheng-Yu Hsieh · Ethan Shen · Dongping Chen · Linda Shapiro · Ranjay Krishna
[ ExHall D ]
Abstract
Multimodal language models (MLMs) continue to struggle with fundamental visual perception tasks that specialist models excel at tasks that require reasoning over 3D benefit from depth estimation; similarly, reasoning 2D over object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over.Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient.To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively.We propose Aurora, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. Aurora employs a VQVAE to transform intermediate image representations, such as depth maps or bounding boxes, into a tokenized format, which is then used in a multi-task training framework.Aurora achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves …
Poster
Akhil Perincherry · Jacob Krantz · Stefan Lee
[ ExHall D ]
Abstract
Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations", we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of ∼1 point and up to ∼0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.
Poster
Fan Yang · Ru Zhen · Jianing Wang · Yanhao Zhang · Haoxiang Chen · Haonan Lu · Sicheng Zhao · Guiguang Ding
[ ExHall D ]
Abstract
AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details; and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation.To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach.Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments. Our dataset and code are included in the …
Poster
Ailin Deng · Tri Cao · Zhirui Chen · Bryan Hooi
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings.By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a \emph{blind faith in text''} phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns.We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training.Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance …
Poster
Christopher Chou · Lisa Dunlap · Wei-Lin Chiang · Koki Mashita · Krishna Mandal · Trevor Darrell · Ion Stoica · Joseph Gonzalez
[ ExHall D ]
Abstract
The growing adoption and capabilities of vision-language models (VLMs) demand benchmarks that reflect real-world user interactions. We introduce VisionArena, the largest existing dataset of crowdsourced real-world conversations between users and VLMs. While most visual question-answering datasets focus on close-ended problems or synthetic scenarios, VisionArena contains a wide variety of closed and open ended problems across 230K conversations, 73K unique users, 138 languages, and 45 VLMs. VisionArena consists of VisionArena-Chat, a set of 200k single-turn and multi-turn chat logs with a VLM, VisionArena-Battle, a set of 30K conversations between a user and 2 anonymous VLMs with preference votes, and VisionArena-Bench, an automatic benchmark consisting of 500 diverse user prompts which can be used to cheaply approximate model rankings.We analyze these datasets and highlight the types of question asked by users, the influence of style on user preference, and areas where models often fall short. We find that more open-ended questions like captioning and humor are heavily influenced by style, which causes certain models like Reka Flash and InternVL which are tuned for style to perform significantly better on these categories compared to other categories. We show that VisionArena-Chat and VisionArena-Battle can be used for post-training to align VLMs to human preferences …
Poster
Wen Yin · Yong Wang · Guiduo Duan · Dongyang Zhang · XIN Hu · Yuan-Fang Li · Tao He
[ ExHall D ]
Abstract
Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring the emotional states of individuals based on visual cues. However, recent approaches predominantly focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To address this limitation, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework for UCDVER. Specifically, KCDP first leverages a vision-language model to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our approach surpasses state-of-the-art models in both perceptibility and generalization, e.g., gaining 12% improvements over SOTA VER model TGCA-PVT.
Poster
Yangyu Huang · Tianyi Gao · Haoran Xu · Qihao Zhao · Yang Song · Zhipeng Gui · Tengchao Lv · Hao Chen · Lei Cui · Scarlett Li · Furu Wei
[ ExHall D ]
Abstract
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering.Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge.To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing.To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding,which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA).Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions.Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o.Our work, em**P**owering g**E**ologic m**A**p holisti**C** und**E**rstanding (**PEACE**) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.
Poster
Chengyue Huang · Brisa Maneechotesuwan · Shivang Chopra · Zsolt Kira
[ ExHall D ]
Abstract
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness Across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. We will release our code to support future work in this area.
Poster
Miran Heo · Min-Hung Chen · De-An Huang · Sifei Liu · Subhashree Radhakrishnan · Seon Joo Kim · Yu-Chiang Frank Wang · Ryo Hachiuma
[ ExHall D ]
Abstract
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of discretized tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and language tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across video sequences.Additionally, we introduce RegVID-300k, a large-scale region-level video instruction dataset curated from diverse public video sources. Omni-RGPT achieves state-of-the-art results on visual commonsense reasoning benchmarks in image-based (VCR) and video-based (Causal-VidQA) tasks while also demonstrating strong performance in captioning and Referring Expression Comprehension (REC) tasks.
Poster
Wenbo Chen · Zhen Xu · Ruotao Xu · Si Wu · Hau San Wong
[ ExHall D ]
Abstract
The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.
Poster
Yue Han · Jiangning Zhang · Junwei Zhu · Runze Hou · Xiaozhong Ji · Chuming Lin · Xiaobin Hu · Xuezhucun Xue · Yong Liu
[ ExHall D ]
Abstract
Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the \textbf{\textit{FacePlayGround-240K}} dataset, the first pioneering large-scale, pixel-grounded face caption and question-answer (QA) dataset, meticulously curated for alignment pretraining and instruction-tuning. We present the \textbf{\textit{GroundingFace}} framework, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.
Poster
Yang Bai · Yucheng Ji · Min Cao · Jinqiao Wang · Mang Ye
[ ExHall D ]
Abstract
Traditional text-based person retrieval (TPR) relies on a single-shot text as query to retrieve the target person, assuming that the query completely captures the user's search intent. However, in real-world scenarios, it can be challenging to ensure the information completeness of such single-shot text. To address this limitation, we propose chat-based person retrieval (ChatPR), a new paradigm that takes an interactive dialogue as query to perform the person retrieval, engaging the user in conversational context to progressively refine the query for accurate person retrieval. The primary challenge in ChatPR is the lack of available dialogue-image paired data. To overcome this challenge, we establish ChatPedes, the first dataset designed for ChatPR, which is constructed by leveraging large language models to automate the question generation and simulate user responses. Additionally, to bridge the modality gap between dialogues and images, we propose a dialogue-refined cross-modal alignment (DiaNA) framework, which leverages two adaptive attribute refiners to bottleneck the conversational and visual information for fine-grained cross-modal alignment. Moreover, we propose a dialogue-specific data augmentation strategy, random round retaining, to further enhance the model's generalization ability across varying dialogue lengths. Extensive experiments demonstrate that DiaNA significantly outperforms existing TPR approaches, highlighting the effectiveness of conversational interactions …
Poster
ruotian peng · Haiying He · Yake Wei · Yandong Wen · Di Hu
[ ExHall D ]
Abstract
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a divide-then-aggregate strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining.
Poster
Likai Tian · Jian Zhao · Zechao Hu · Zhengwei Yang · Hao Li · Lei Jin · Zheng Wang · Xuelong Li
[ ExHall D ]
Abstract
Composed Image Retrieval (CIR) is a multi-modal task that seeks to retrieve target images by harmonizing a reference image with a modified instruction. The main challenge in CIR lies in compositional conflicts between the reference image (e.g., blue, long sleeve) and the modified instruction (e.g., grey, short sleeve). Previous works attempt to mitigate such conflicts through feature-level manipulation, commonly employing learnable masks to obscure conflicting features within the reference image. However, the inherent complexity of feature spaces poses significant challenges in precise conflict neutralization, thereby leading to uncontrollable results. To this end, this paper proposes the Compositional Conflict Identification and Neutralization (CCIN) framework, which sequentially identifies and neutralizes compositional conflicts for effective CIR. Specifically, CCIN comprises two core modules: 1) Compositional Conflict Identification module, which utilizes LLM-based analysis to identify specific conflicting attributes, and 2) Compositional Conflict Neutralization module, which first generates a kept instruction to preserve non-conflicting attributes, then neutralizes conflicts under collaborative guidance of both the kept and modified instructions. Furthermore, an orthogonal parameter regularization loss is introduced to emphasize the distinction between target and conflicting features. Extensive experiments demonstrate the superiority of CCIN over the state-of-the-arts.
Poster
You Li · Fan Ma · Yi Yang
[ ExHall D ]
Abstract
The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions.Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text.In this paper, we introduce Imagined Proxy for CIR, a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process.We first leverage the large language model’s generalization capability to generate an image layout, and then apply both the query text and image for conditional generation.The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process.Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising …
Poster
Chuong Huynh · Jinyu Yang · Ashish Tawari · Mubarak Shah · Son Dinh Tran · Raffay Hamid · Trishul Chilimbi · Abhinav Shrivastava
[ ExHall D ]
Abstract
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM …
Poster
Zhenxing Zhang · Yaxiong Wang · Lechao Cheng · Zhun Zhong · Dan Guo · Meng Wang
[ ExHall D ]
Abstract
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.
Poster
Yikun Liu · Pingan Chen · jiayin cai · Xiaolong Jiang · Yao Hu · Jiangchao Yao · Yanfeng Wang · Weidi Xie
[ ExHall D ]
Abstract
With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables to unify of all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce **LamRA**, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks. The code and model will be made publicly available for reproduction.
Poster
Yuchen Duan · Zhe Chen · Yusong Hu · Weiyun Wang · Shenglong Ye · Botian Shi · Lewei Lu · Qibin Hou · Tong Lu · Hongsheng Li · Jifeng Dai · Wenhai Wang
[ ExHall D ]
Abstract
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets.While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from original documents.Building on the dataset, we developed a native multimodal model—Docopilot, which can accurately handle document-level dependencies without relying on RAG.Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models shall be released.
Poster
Wenhui Liao · Jiapeng Wang · Hongliang Li · Chengyu Wang · Jun Huang · Lianwen Jin
[ ExHall D ]
Abstract
Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors.
Poster
Jeongryong Lee · Yejee Shin · Geonhui Son · Dosik Hwang
[ ExHall D ]
Abstract
The modality gap between vision and text embeddings in CLIP presents a significant challenge for zero-shot image captioning, limiting effective cross-modal representation. Traditional approaches, such as noise injection and memory-based similarity matching, attempt to address this gap, yet these methods either rely on indirect alignment or relatively naive solutions with heavy computation. Diffusion Bridge introduces a novel approach to directly reduce this modality gap by leveraging Denoising Diffusion Probabilistic Models (DDPM), trained exclusively on text embeddings to model their distribution. Our approach is motivated by the observation that, while paired vision and text embeddings are relatively close, a modality gap still exists due to stable regions created by the contrastive loss. This gap can be interpreted as noise in cross-modal mappings, which we approximate as Gaussian noise. To bridge this gap, we employ a reverse diffusion process, where image embeddings are strategically introduced at an intermediate step in the reverse process, allowing them to be refined progressively toward the text embedding distribution. This process transforms vision embeddings into text-like representations closely aligned with paired text embeddings, effectively minimizing residual discrepancies. Experimental results demonstrate that these text-like vision embeddings significantly enhance alignment with their paired text embeddings, leading to improved zero-shot …
Poster
Matt Deitke · Christopher Clark · Sangho Lee · Rohun Tripathi · Yue Yang · Jae Sung Park · Reza Salehi · Niklas Muennighoff · Kyle Lo · Luca Soldaini · Jiasen Lu · Taira Anderson · Erin Bransom · Kiana Ehsani · Huong Ngo · Yen-Sung Chen · Ajay Patel · Mark Yatskar · Chris Callison-Burch · Andrew Head · Rose Hendrix · Favyen Bastani · Eli VanderBilt · Nathan Lambert · Yvonne Chou · Arnavi Chheda-Kothary · Jenna Sparks · Sam Skjonsberg · Michael Schmitz · Aaron Sarnat · Byron Bischoff · Pete Walsh · Christopher Newell · Piper Wolters · Tanmay Gupta · Kuo-Hao Zeng · Jon Borchardt · Dirk Groeneveld · Crystal Nam · Sophie Lebrecht · Caitlin Wittlif · Carissa Schoenick · Oscar Michel · Ranjay Krishna · Luca Weihs · Noah A. Smith · Hannaneh Hajishirzi · Ross Girshick · Ali Farhadi · Aniruddha Kembhavi
[ ExHall D ]
Abstract
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present \textbf{Molmo}, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets, including a dataset of highly detailed image captions for pre-training called \textbf{PixMo}, a free-form image Q\&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code will all be released.
Poster
Guotao liang · Baoquan Zhang · Zhiyuan Wen · Junteng Zhao · Yunming Ye · Guangming Ye · Yao He
[ ExHall D ]
Abstract
Image quantization is a crucial technique in image generation, aimed at learning a codebook that encodes an image into a discrete token sequence. Recent advancements have seen researchers exploring learning multi-modal codebook (i.e., text-aligned codebook) by utilizing image caption semantics, aiming to enhance codebook performance in cross-modal tasks. However, existing image-text paired datasets exhibit a notable flaw in that the text descriptions tend to be overly concise, failing to adequately describe the images and provide sufficient semantic knowledge, resulting in limited alignment of text and codebook at a fine-grained level.In this paper, we propose a novel Text-Augmented Codebook Learning framework, named TA-VQ, which generates longer text for each image using the visual-language model for improved text-aligned codebook learning.However, the long text presents two key challenges: how to encode text and how to align codebook and text. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence, so that the long text can be fully encoded without losing any key semantic knowledge. Following this, a hierarchical encoder and novel sampling-based alignment strategy are designed to achieve fine-grained codebook-text alignment. Additionally, our method can be seamlessly integrated into existing VQ models. …
Poster
Hyungyu Choi · Young Kyun Jang · Chanho Eom
[ ExHall D ]
Abstract
Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy, detailed text descriptions due to their training focus on concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
Poster
Anjia Cao · Xing Wei · Zhiheng Ma
[ ExHall D ]
Abstract
Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9\% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6\% in text-to-image recall@1 for long-context retrieval on Urban-1k. The code will be released soon.
Poster
Shuokai Pan · Gerti Tuzi · Sudarshan Sreeram · Dibakar Gope
[ ExHall D ]
Abstract
Despite the revolutionary breakthroughs of large-scale text-to-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training …
Poster
Yassine Ouali · Adrian Bulat · ALEXANDROS XENOS · Anestis Zaganidis · Ioannis Maniadis Metaxas · Brais Martinez · Georgios Tzimiropoulos
[ ExHall D ]
Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a bag of words'' behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks.In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding.Our contributions include (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
Poster
Tianyu Chen · Xingcheng Fu · Yisen Gao · Haodong Qian · Yuecen Wei · Kun Yan · Haoyi Zhou · Jianxin Li
[ ExHall D ]
Abstract
Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy-scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation (R² scores up to 0.91) and morphology classification tasks (up to +0.17 F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.
Poster
Zhijian Liu · Ligeng Zhu · Baifeng Shi · Zhuoyang Zhang · Yuming Lou · Shang Yang · Haocheng Xi · Shiyi Cao · Yuxian Gu · Dacheng Li · Xiuyu Li · Haotian Tang · Yunhao Fang · Yukang Chen · Cheng-Yu Hsieh · De-An Huang · An-Chieh Cheng · Jinyi Hu · Sifei Liu · Ranjay Krishna · Pavlo Molchanov · Jan Kautz · Danny Yin · Song Han · Yao Lu
[ ExHall D ]
Abstract
Visual language models (VLMs) have made significant strides in accuracy in recent years, but their efficiency has often been overlooked. This paper presents NVILA, a family of open frontier VLMs designed to optimize both efficiency and accuracy. Building upon VILA, we improve its model architecture by first scaling up spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach allows NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic study to improve NVILA's efficiency throughout its entire lifecycle—from training and fine-tuning to deployment. NVILA is both efficient and accurate, reducing training costs by 4.5×, fine-tuning memory usage by 3.4×, prefilling latency by 1.8× and decoding throughput by 1.2×. It also achieves state-of-the-art results across diverse image and video benchmarks. We will release our implementation and trained models to support full reproducibility.
Poster
Jing Bi · Lianggong Bruce Wen · Zhang Liu · JunJia Guo · Yunlong Tang · Bingjie Wang · Chenliang Xu
[ ExHall D ]
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.
Poster
Xudong LU · Yinghao Chen · chencheng Chen · Hui Tan · Boheng Chen · yina xie · Rui Hu · Guanxin tan · Renshou Wu · Yan Hu · Yi Zeng · Lei Wu · Liuyang Bian · Zhaoxiong Wang · Long Liu · Yanzhou Yang · Han Xiao · Aojun Zhou · Yafei Wen · Xiaoxin Chen · Shuai Ren · Hongsheng Li
[ ExHall D ]
Abstract
The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present **BlueLM-V-3B**, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) **Small Size**: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) **Fast Speed**: BlueLM-V-3B achieves a generation speed of **24.4** token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) **Strong Performance**: BlueLM-V-3B has attained the highest average score of **66.1** on the OpenCompass benchmark among models with ≤ 4B parameters and surpassed …
Poster
Junyan Lin · Haoran Chen · Yue Fan · Yingqi Fan · Xin Jin · Hui Su · Jinlan Fu · Xiaoyu Shen
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations.
Poster
Xiao Guo · Xiufeng Song · Yue Zhang · Xiaohong Liu · Xiaoming Liu
[ ExHall D ]
Abstract
Deepfake detection is a long-established research topic crucial for combating the spread of malicious misinformation. Unlike previous methods that provide either binary classification results or textual explanations for deepfake detection, we propose a novel method that delivers both simultaneously. Our method harnesses the multi-modal learning power of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and interpretability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs specially designed face forgery prompt learning, integrating zero-shot learning capabilities of the pre-trained CLIP to improve generalization to unseen forgeries.Also, M2F2-Det incorporates the LLM to provide detailed explanations for detection decisions, offering strong interpretability by bridging the gap between natural language and the subtle nuances of facial forgery detection. Empirically, we evaluate M2F2-Det for both detection and sentence generation tasks, on both of which M2F2-Det achieves state-of-the-art performance, showing its effectiveness in detecting and explaining diverse and unseen forgeries. Code and models will be released upon publication.
Poster
Shiyao Li · Yingchun Hu · Xuefei Ning · Xihui Liu · Ke Hong · xiaotao jia · Xiuhong Li · Yaqi Yan · PEI RAN · Guohao Dai · Shengen Yan · Huazhong Yang · Yu Wang
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) have already enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on the language modality in large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX …
Poster
Qianhan Feng · Wenshuo Li · Tong Lin · Xinghao Chen
[ ExHall D ]
Abstract
Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average …
Poster
Xianwei Zhuang · Zhihong Zhu · Yuxin Xie · Liming Liang · Yuexian Zou
[ ExHall D ]
Abstract
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance …
Poster
Dokyoon Yoon · Youngsook Song · Woomyoung Park
[ ExHall D ]
Abstract
Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose TL-DPO, a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that TL-DPO effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.
Poster
Junzhe Chen · Tianshu Zhang · Shiyu Huang · Yuwei Niu · Linfeng Zhang · Lijie Wen · Xuming Hu
[ ExHall D ]
Abstract
Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.
Poster
Tobia Poppi · Tejaswi Kasarla · Pascal Mettes · Lorenzo Baraldi · Rita Cucchiara
[ ExHall D ]
Abstract
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model’s knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling – ineffective in standard vision-language models due to their reliance on Euclidean embeddings – endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition, but also establishes a more adaptable and interpretable framework for …
Poster
Jin Wang · Chenghui Lv · Xian Li · Shichao Dong · Huadong Li · kelu Yao · Chao Li · Wenqi Shao · Ping Luo
[ ExHall D ]
Abstract
Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc.To detect the ever-increasingly **diverse** malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design **robust** forgery detectors due to their impressive performance on a **wide** range of multimodal tasks.However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media.To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries.Forensics-Bench comprises 63,292 meticulously curated multi-choicevisual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models.We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench.We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC.
Poster
Haoyu Zhang · Yangyang Guo · Mohan Kankanhalli
[ ExHall D ]
Abstract
Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor to mitigate the social bias problem in V-L models by removing biased attribute information from model embeddings. However, after our revisiting of these methods, we find that their bias removal is frequently accompanied by greatly compromised V-L alignment capabilities. We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. By doing so, our method achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings. Additionally, we advocate a new evaluation protocol that can 1) holistically quantify the model debiasing and V-L alignment ability, and 2) evaluate the generalization of social bias removal models. We believe this work will offer new insights and guidance for future studies addressing the social bias problem in CLIP.
Poster
Shin'ya Yamaguchi · Dewei Feng · Sekitoshi Kanai · Kazuki Adachi · Daiki Chijiwa
[ ExHall D ]
Abstract
Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks …
Poster
Karsten Roth · Zeynep Akata · Dima Damen · Ivana Balazevic · Olivier J Henaff
[ ExHall D ]
Abstract
Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5\%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations surpass significantly more complex optimization-based adaptation schemes.
Poster
Yi Zhang · Yi-Xuan Deng · Meng-Hao Guo · Shi-Min Hu
[ ExHall D ]
Abstract
Vision-language models (VLMs) like CLIP have been widely used in various specific tasks.Parameter-efficient fine-tuning (PEFT) methods, such as prompt and adapter tuning,have become key techniques for adapting these models to specific domains.However, existing approaches rely on prior knowledgeto manually identify the locations requiring fine-tuning.Adaptively selecting which parameters in VLMs should be tuned remains unexplored. In this paper, we propose CLIP with Adaptive Selective Tuning (CLIP-AST), which can be used to automatically select critical parameters in VLMs for fine-tuning for specific tasks.It opportunely leveragesthe adaptive learning rate in the optimizer and improves model performance without extra parameter overhead. We conduct extensive experiments on 13 benchmarks, such as ImageNet, Food101, Flowers102, etc,with different settings, including few-shot learning, base-to-novel class generalization, and out-of-distribution. The results show that CLIP-AST consistently outperforms the original CLIP model as well as its variantsand achieves state-of-the-art (SOTA) performance in all cases. For example, with the 16-shot learning, CLIP-AST surpasses GraphAdapter and PromptSRC by 3.56\% and 2.20\% in average accuracy on 11 datasets, respectively.Code will be publicly available.
Poster
Yabin Wang · Zhiwu Huang · Xiaopeng Hong
[ ExHall D ]
Abstract
This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23\% in IoU (14.11\% in F1) and 2.05\% in accuracy (2.38\% in F1) compared to the second-best model in detection and localization tasks, respectively.
Poster
Jianyu LAI · Sixiang Chen · yunlong lin · Tian Ye · Yun Liu · Song Fei · Zhaohu Xing · Hongtao Wu · Weiming Wang · Lei Zhu
[ ExHall D ]
Abstract
Snowfall poses a significant challenge to visual data processing, requiring specialized desnowing algorithms. However, current models often struggle with generalization due to their reliance on synthetic datasets, creating a domain gap. Evaluating real snowfall images is difficult due to the lack of ground truth. To tackle these issues, we introduce a large-scale, high-quality dataset of 10,000 annotated real snow scenes, develop a dataset with 36k preference pairs based on human expert rankings, enhance multimodal large language models' perception of snowfall images using direct preference optimization (DPO), and refine desnowing models through a mean teacher semi-supervised framework with high-quality pseudo-label screening. This Framework substantially improves the generalization and performance of desnowing models on real snowfall images.
Poster
Meng Lou · Yizhou Yu
[ ExHall D ]
Abstract
In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel **Cont**ext-**Mix**ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing …
Poster
Damien Teney · Liangze Jiang · Florin Gogianu · Ehsan Abbasnejad
[ ExHall D ]
Abstract
Common choices of architecture give neural networks a preference for fitting data with simple functions. This simplicity bias is known as key to their success. This paper explores the limits of this assumption. Building on recent work that showed that activation functions are the origin of the simplicity bias (Teney, 2024), we introduce a method to meta-learn activation functions to modulate this bias.**Findings.** We discover multiple tasks where the assumption of simplicity is inadequate, and standard ReLU architectures are therefore suboptimal. In these cases, we find activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias proves adequate on image tasks, where learned activations are nearly identical to ReLUs and GeLUs.**Implications.** (1) Contrary to common belief, the simplicity bias is not universally useful. There exist real tasks where it is suboptimal. (2) The suitability of ReLU models for image classification is not accidental. (3) The success of ML ultimately depends on the adequacy between data and architectures, and there may be benefits for architectures tailored to specific distributions of …
Poster
Faridoun Mehri · Mahdieh Baghshah · Mohammad Taher Pilehvar
[ ExHall D ]
Abstract
Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad—a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods—including Transformer-specific approaches—across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models—two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available.
Poster
Kevin Miller · Aditya Gangrade · Samarth Mishra · Kate Saenko · Venkatesh Saligrama
[ ExHall D ]
Abstract
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and AND''/OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.
Poster
Haozhen Zhang · Zhaogeng Liu · Hualin Zhang · Xingchen Li · Wanli Shi · Bin Gu · Yi Chang
[ ExHall D ]
Abstract
Visual Prompt Learning (VPL) has emerged as a powerful strategy for harnessing the capabilities of large-scale pre-trained models (PTMs) to tackle specific downstream tasks. However, the opaque nature of PTMs in many real-world applications has led to a growing interest in gradient-free approaches within VPL. A significant challenge with existing black-box VPL methods lies in the high dimensionality of visual prompts, which necessitates considerable API queries for tuning, thereby impacting efficiency. To address this issue, we propose a novel query-efficient framework for black-box visual prompting, designed to generate input-dependent visual prompts efficiently for large-scale black-box PTMs. Our framework is built upon the insight of reparameterizing prompts using neural networks, improving the typical pre-training-fine-tuning paradigm through the subspace learning strategy to maximize efficiency and adaptability from both the perspective of initial weights and parameter dimensionality.This tuning intrinsically optimizes low-dimensional representations within the well-learned subspace, enabling the efficient adaptation of the network to downstream tasks.Our approach significantly reduces the necessity for substantial API queries to PTMs, presenting an efficient method for leveraging large-scale black-box PTMs in visual prompting tasks. Most experimental results across various benchmarks demonstrate the effectiveness of our method, showcasing substantial reductions in the number of required API queries to …
Poster
Xueyu Liu · Rui Wang · Yexin Lai · Guangze Shi · Feixue Shao · Fang Hao · Jianan Zhang · Jia Shen · Yongfei Wu · Wen Zheng
[ ExHall D ]
Abstract
Powered by extensive curated training data, the Segment Anything Model (SAM) demonstrates impressive generalization capabilities in open-world scenarios, effectively guided by user-provided prompts. However, the class-agnostic characteristic of SAM renders its segmentation accuracy highly dependent on prompt quality. In this paper, we propose a novel pplug-and-play dual-space Point Prompt Optimizer (PPO) designed to enhance prompt distribution through deep reinforcement learning (DRL)-based heterogeneous graph optimization. PPO optimizes initial prompts for any task without requiring additional training, thereby improving SAM’s downstream segmentation performance. Specifically, PPO constructs a dual-space heterogeneous graph, leveraging the robust feature-matching capabilities of a foundational pre-trained model to create internal feature and physical distance matrices. A DRL policy network iteratively refines the distribution of prompt points, optimizing segmentation predictions. We conducted experiments on four public datasets. The ablation study explores the necessity and balance of optimizing prompts in both feature and physical spaces. The comparative study shows that PPO enables SAM to surpass recent one-shot methods. Additionally, experiments with different initial prompts demonstrate PPO's generality across prompts generated by various methods. In conclusion, PPO redefines the prompt optimization problem as a heterogeneous graph optimization task, using DRL to construct an effective, plug-and-play prompt optimizer. This approach holds potential for …
Poster
Xueyi Ke · Satoshi Tsutsui · Yayun Zhang · Bihan Wen
[ ExHall D ]
Abstract
Infants develop complex visual understanding rapidly, even preceding of the acquisition of linguistic inputs. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We introduce a training-free framework that can discover visual concept neurons hidden in the model's internal representations. Our findings show that these neurons can classify objects outside its original vocabulary. Furthermore, we compare the visual representations in infant-like models with those in modern computer vision models, such as CLIP or ImageNet pre-trained model, highlighting key similarities and differences. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant's visual and linguistic inputs.
Poster
Li Ren · Chen Chen · Liqiang Wang · Kien A. Hua
[ ExHall D ]
Abstract
Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage \textit{metric learning} techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, \textbf{D}istribution \textbf{A}ware \textbf{V}isual \textbf{P}rompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks. The code will be released.
Poster
wenlong yu · Qilong Wang · Chuang Liu · Dong Li · Qinghua Hu
[ ExHall D ]
Abstract
Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Ultimately, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement …
Poster
Arpita Chowdhury · Dipanjyoti Paul · Zheda Mai · Jianyang Gu · Ziheng Zhang · Kazi Sajeed Mehrab · Elizabeth Campolongo · Daniel Rubenstein · Charles Stewart · Anuj Karpatne · Tanya Berger-Wolf · Yu Su · Wei-Lun Chao
[ ExHall D ]
Abstract
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that highlight visually similar categories' identities (e.g., different bird species or dog breeds). Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach **Prompt Class Attention Map(Prompt-CAM)** to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs.Extensive empirical studies on a dozen datasets …
Poster
Aaron Serianni · Tyler Zhu · Olga Russakovsky · Vikram V. Ramaswamy
[ ExHall D ]
Abstract
Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-loU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-loU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-loU reveals potential confounding variables not present in dataset labels.
Poster
Guangda Ji · Silvan Weder · Francis Engelmann · Marc Pollefeys · Hermann Blum
[ ExHall D ]
Abstract
Neural network performance scales with both model size and data volume, as shown in language and image processing. This requires scaling-friendly architectures and large datasets. While transformers have been adapted for 3D vision, a 'GPT-moment' remains elusive due to limited training data. We introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we enhance ARKitScenes with automatically generated dense labels using an extended LabelMaker pipeline, tailored for large-scale pre-training. Training on this dataset improves accuracy across architectures, achieving state-of-the-art results on ScanNet and ScanNet200, with notable gains on tail classes. We compare our results with self-supervised methods and synthetic data, evaluating the effects on downstream tasks and zero-shot generalization. The dataset will be publicly available.
Poster
Andrey Gizdov · Shimon Ullman · Daniel Harari
[ ExHall D ]
Abstract
Large multimodal models (LMMs) typically process visual inputs with uniform resolution across the entire field of view, leading to inefficiencies when non-critical image regions are processed as precisely as key areas. Inspired by the human visual system's foveated approach, we apply a biologically inspired method to leading architectures such as MDETR, BLIP2, InstructBLIP, LLaVA, and ViLT, and evaluate their performance with variable resolution inputs. Results show that foveated sampling boosts accuracy in visual tasks like question answering and object detection under tight pixel budgets, improving performance by up to 2.7% on the GQA dataset, 2.1% on SEED-Bench, and 2.0% on VQAv2 compared to uniform sampling. Furthermore, our research indicates that indiscriminate resolution increases yield diminishing returns, with models achieving up to 80% of their full capability using just 3% of the pixels, even on complex tasks. Foveated sampling also prompts more human-like processing within models, such as neuronal selectivity and globally-acting self-attention in vision transformers. This paper provides a foundational analysis of foveated sampling's impact on existing models, suggesting that more efficient architectural adaptations, mimicking human visual processing, are a promising research venue for the community.
Poster
Weiming Zhuang · Chen Chen · Zhizhong Li · Sina Sajadmanesh · Jingtao Li · Jiabo Huang · Vikash Sehwag · Vivek Sharma · Hirotaka Shinozaki · Felan Carlo Garcia · Yihao Zhan · Naohiro Adachi · Ryoji Eki · Michael Spranger · Peter Stone · Lingjuan Lyu
[ ExHall D ]
Abstract
While existing vision and multi-modal foundation models can handle multiple computer vision tasks, they often suffer from significant limitations, including huge demand for data and computational resources during training and inconsistent performance across vision tasks at deployment time. To address these challenges, we introduce Argus (The name comes from Argus Panoptes--a hundred-eyed giant with ''all-seeing'' capability in Greek mythology), a compact and versatile vision foundation model designed to support a wide range of vision tasks through a unified multitask architecture. Argus employs a two-stage training strategy: (i) multitask pretraining over core vision tasks with a shared backbone that includes a lightweight adapter to inject task-specific inductive biases, and (ii) scalable and efficient adaptation to new tasks by fine-tuning only the task-specific decoders. Extensive evaluations demonstrate that Argus, despite its relatively compact and training-efficient design of merely 100M backbone parameters (only 13.6\% of which are trained using 1.6M images), competes with and even surpasses much larger models. Compared to state-of-the-art foundation models, Argus not only covers a broader set of vision tasks but also matches or outperforms the models with similar sizes on 12 tasks. We expect that Argus will accelerate the real-world adoption of vision foundation models in resource-constrained scenarios.
Poster
Unki Park · Seongmoon Jeong · Jang Youngchan · Gyeong-Moon Park · Jong Hwan Ko
[ ExHall D ]
Abstract
The field of computer vision initially focused on human visual perception and has progressively expanded to encompass machine vision, with ongoing advancements in technology expected to drive further expansion. Consequently, image compressors must effectively respond not only to human visual perception but also to the current and future machine vision tasks. Towards this goal, this paper proposes a Test-Time Fine-Tuning (TTFT) approach for adapting Learned Image Compression (LIC) to multiple tasks. A large-scale LIC model, originally trained for human perception, is adapted to both closed-set and open-set machine tasks through TTFT using Singular Value Decomposition based Low Rank Adaptation (SVD-LoRA). The SVD-LoRA layers are applied to both the encoder and decoder of backbone LIC model, with a modified learning scheme employed for the decoder during TTFT to train only the singular values, preventing excessive bitstream overhead. This enables instance-specific optimization for the target task. Experimental results demonstrate that the proposed method effectively adapts the backbone compressor to diverse machine tasks, outperforming competing methods.
Poster
Sofia Casarin · Sergio Escalera · Oswald Lanz
[ ExHall D ]
Abstract
Training-free Neural Architecture Search (NAS) efficiently identifies high-performing neural networks using zero-cost (ZC) proxies. Unlike multi-shot and one-shot NAS approaches, ZC-NAS is both (i) time-efficient, eliminating the need for model training, and (ii) interpretable, with proxy designs often theoretically grounded. Despite rapid developments in the field, current SOTA ZC proxies are typically constrained to well-established convolutional search spaces. With the rise of Large Language Models shaping the future of deep learning, this work extends ZC proxy applicability to Vision Transformers (ViTs). We present a new benchmark using the Autoformer search space evaluated on 6 distinct tasks, and propose Layer-Sample Wise Activation with Gradients information (L-SWAG), a novel, generalizable metric that characterises both convolutional and transformer architectures across 14 tasks. Additionally, previous works highlighted how different proxies contain complementary information, motivating the need for a ML model to identify useful combinations. To further enhance ZC-NAS, we therefore introduce LIBRA-NAS (Low Information gain and Bias Re-Alignment), a method that strategically combines proxies to best represent a specific benchmark. Integrated into the NAS search, LIBRA-NAS outperforms evolution and gradient-based NAS techniques by identifying an architecture with a 17.0% test error on ImageNet1k in just 0.1 GPU days.
Poster
Zekang Yang · Wang ZENG · Sheng Jin · Chen Qian · Ping Luo · Wentao Liu
[ ExHall D ]
Abstract
Designing effective neural architectures poses a significant challenge in deep learning. While Neural Architecture Search (NAS) automates the search for optimal architectures, existing methods are often constrained by predetermined search spaces and may miss critical neural architectures. In this paper, we introduce NADER (Neural Architecture Design via multi-agEnt collaboRation), a novel framework that formulates neural architecture design (NAD) as a LLM-based multi-agent collaboration problem. NADER employs a team of specialized agents to enhance a base architecture through iterative modification. Current LLM-based NAD methods typically operate independently, lacking the ability to learn from past experiences, which results in repeated mistakes and inefficient exploration. To address this issue, we propose the Reflector, which effectively learns from immediate feedback and long-term experiences. Additionally, unlike previous LLM-based methods that use code to represent neural architectures, we utilize a graph-based representation. This approach allows agents to focus on design aspects without being distracted by coding. We demonstrate the effectiveness of NADER in discovering high-performing architectures beyond predetermined search spaces through extensive experiments on benchmark tasks, showcasing its advantages over state-of-the-art methods.
Poster
Minghao Fu · Hao Yu · Jie Shao · Junjie Zhou · Ke Zhu · Jianxin Wu
[ ExHall D ]
Abstract
Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization …
Poster
Hongjun Wang · Wonmin Byeon · Jiarui Xu · Jinwei Gu · Ka Chun Cheung · Jan Kautz · Xiaolong Wang · Kai Han · Sifei Liu
[ ExHall D ]
Abstract
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a unique line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to √N, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84× when generating 16K images. Codes will be released upon publication.
Poster
Weihao Yu · Xinchao Wang
[ ExHall D ]
Abstract
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification on ImageNet does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks on COCO or ADE20K are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of …
Poster
Haoyang He · Jiangning Zhang · Yuxuan Cai · Hongxu Chen · Xiaobin Hu · Zhenye Gan · Yabiao Wang · Chengjie Wang · Yunsheng Wu · Lei Xie
[ ExHall D ]
Abstract
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction (MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba), Efficient Multi-Kernel Depthwise Deconvolution (MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum ×21↑ faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.
Poster
Hao Yu · Tangyu Jiang · Shuning Jia · Shannan Yan · Shunning Liu · Haolong Qian · Guanghao Li · Shuting Dong · Chun Yuan
[ ExHall D ]
Abstract
The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information.Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position.Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism.However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity.In this work, we propose **ComRoPE**, which generalizes **RoPE** by defining it in terms of trainable **com**muting angle matrices.Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation,which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset.Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available …
Poster
Yuwei Sun · Hideya Ochiai · Zhirong Wu · Stephen Lin · Ryota Kanai
[ ExHall D ]
Abstract
Emerging from the pairwise attention in conventional Transformers, there is a growing interest in sparse attention mechanisms that align more closely with localized, contextual learning in the biological brain. Existing studies such as the Coordination method employ iterative cross-attention mechanisms with a bottleneck to enable the sparse association of inputs. However, these methods are parameter inefficient and fail in more complex relational reasoning tasks. To this end, we propose Associative Transformer (AiT) to enhance the association among sparsely attended input tokens, improving parameter efficiency and performance in various classification and relational reasoning tasks. AiT leverages a learnable explicit memory comprising specialized priors that guide bottleneck attentions to facilitate the extraction of diverse localized tokens. Moreover, AiT employs an associative memory-based token reconstruction using a Hopfield energy function. The extensive empirical experiments demonstrate that AiT requires significantly fewer parameters and attention layers outperforming a broad range of sparse Transformer models. Additionally, AiT outperforms the SOTA sparse Transformer models including the Coordination method on the Sort-of-CLEVR dataset.
Poster
Jon Donnelly · Zhicheng Guo · Alina Jade Barnett · Hayden McTavish · Chaofan Chen · Cynthia Rudin
[ ExHall D ]
Abstract
Interpretability is critical for machine learning models in high-stakes settings because it allows users to verify the model's reasoning. In computer vision, prototypical part models (ProtoPNets) have become the dominant model type to meet this need.Users can easily identify flaws in ProtoPNets, but fixing problems in a ProtoPNet requires slow, difficult retraining that is not guaranteed to resolve the issue. This problem is called the "interaction bottleneck."We solve the interaction bottleneck for ProtoPNets by simultaneously finding many equally good ProtoPNets (i.e., a draw from a "Rashomon set"). We show that our framework -- called Proto-RSet -- quickly produces many accurate, diverse ProtoPNets, allowing users to correct problems in real time while maintaining performance guarantees with respect to the training set. We demonstrate the utility of this method in two settings: 1) removing synthetic bias introduced to a bird-identification model and 2) debugging a skin cancer identification model. This tool empowers non-machine-learning experts, such as clinicians or domain experts, to quickly refine and correct machine learning models _without_ repeated retraining by machine learning experts.
Poster
Xin Lin · Chong Shi · Zuopeng Yang · Haojin Tang · Zhili Zhou
[ ExHall D ]
Abstract
Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on GitHub.
Poster
Kiet A. Nguyen · Adheesh Juvekar · Tianjiao Yu · Muntasir Wahed · Ismini Lourentzou
[ ExHall D ]
Abstract
Recent advances in Large Vision-Language Models (LVLMs) advances have sparked significant progress in general-purpose vision tasks through visual instruction tuning. While some works have demonstrated the capability of LVLMs to generate segmentation masks that align phrases with natural language descriptions in a single image, they struggle with segmentation-grounded comparisons across multiple images, particularly at finer granularities such as object parts. In this paper, we introduce the new task of *part-focused semantic co-segmentation*, which seeks to identify and segment common and unique objects and parts across multiple images. To address this task, we present CALICO, the first LVLM that can segment and reason over multiple masks across images, enabling object comparison based on their constituent parts. CALICO features two novel components, a Correspondence Extraction Module, which captures semantic-rich information to identify part-level correspondences between objects, and a Correspondence Adaptation Module, which embeds this information into the LLM and facilitates multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MIXEDPARTS, a comprehensive multi-image segmentation dataset containing ∼2.4M samples across ∼44K images with diverse object and part categories. Experimental results show CALICO, finetuned on only 0.3\% of its architecture, achieves robust performance in part-focused semantic co-segmentation. Code, models, and …
Poster
Zelin Peng · Zhengqin Xu · Zhilin Zeng · Changsong Wen · Yu Huang · Menglin Yang · feilong tang · Wei Shen
[ ExHall D ]
Abstract
CLIP, a foundational vision-language model, has emerged as a powerful tool for open-vocabulary semantic segmentation. While freezing the text encoder preserves its powerful embeddings, recent studies show that fine-tuning both the text and image encoders jointly significantly enhances segmentation performance, especially for classes from open sets. In this work, we explain this phenomenon from the perspective of hierarchical alignment, since during fine-tuning, the hierarchy level of image embeddings shifts from image-level to pixel-level. We achieve this by leveraging hyperbolic space, which naturally encoders hierarchical structures. Our key observation is that, during fine-tuning, the hyperbolic radius of CLIP’s text embeddings decreases, facilitating better alignment with the pixel-level hierarchical structure of visual data. Building on this insight, we propose HyperCLIP, a novel fine-tuning strategy that adjusts the hyperbolic radius of the text embeddings through scaling transformations. By doing so, HyperCLIP equips CLIP with segmentation capability while introducing only a small number of learnable parameters. Our experiments demonstrate that HyperCLIP achieves state-of-the-art performance on open-vocabulary semantic segmentation tasks across three benchmarks, while fine-tuning only approximately 4\% of the total parameters of CLIP. More importantly, we observe that after adjustment, CLIP's text embeddings exhibit a relatively fixed hyperbolic radius across datasets, suggesting that the …
Poster
Pei Wang · Zhaowei Cai · Hao Yang · Ashwin Swaminathan · R. Manmatha · Stefano Soatto
[ ExHall D ]
Abstract
Traditional segmentation models, while effective in isolated tasks, often fail to generalize to more complex and open-ended segmentation problems, such as free-form, open-vocabulary, and in-the-wild scenarios. To bridge this gap, we propose to scale up image segmentation across diverse datasets and tasks such that the knowledge across different tasks and datasets can be integrated while improving the generalization ability. QueryMeldNet, a novel segmentation framework, is introduced and designed to scale seamlessly across both data size and task diversity. It is built upon a dynamic object query mechanism called query meld, which fuses different types of queries using cross-attention. This hybrid approach enables the model to balance between instance- and stuff-level segmentation, providing enhanced scalability for handling diverse object types. We further enhance scalability by leveraging synthetic data-generating segmentation masks and captions for pixel-level and open-vocabulary tasks-drastically reducing the need for costly human annotations. By training on multiple datasets and tasks at scale, QueryMeldNet continuously improves performance as the volume and diversity of data and tasks increase. It exhibits strong generalization capabilities, boosting performance in open-set segmentation tasks SeginW by 7 points. These advancements mark a key step toward universal, scalable segmentation models capable of addressing the demands of real-world applications.
Poster
Amin Karimi · Charalambos Poullis
[ ExHall D ]
Abstract
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-5i and COCO-20i demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios.
Poster
Yuchen Zhu · Cheng Shi · Dingyou Wang · Jiajin Tang · Zhengxuan Wei · Yu Wu · Guanbin Li · Sibei Yang
[ ExHall D ]
Abstract
Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring ''perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative ''visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available.
Poster
Seun-An Choe · Keon Hee Park · Jinwoo Choi · Gyeong-Moon Park
[ ExHall D ]
Abstract
Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled synthetic data (source) to unlabeled real-world data (target). Traditional UDA-SS methods work on the assumption that the category settings between the source and target domains are known in advance. However, in real-world scenarios, the category settings of the source and target are unknown due to the lack of labels in the target, resulting in the existence of target-private or source-private classes. Traditional UDA-SS methods struggle with this change, leading to negative transfer and performance degradation. To address these issues, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS) for the first time to achieve good performance even when the category settings of source and target are unknown. We defined the problem in the UniDA-SS scenario as that the confidence score of common classes in the target is lowered, which leads to confusion with private classes. To solve this problem, we propose two methods.The first method is Domain-Specific Prototype-based Distinction, which divides a class into two prototypes, each with domain-specific features, and then weights the common class based on the fact that if it is common, it will be similar to both prototypes. The second method is Target-based …
Poster
Yuhan Liu · Yixiong Zou · Yuhua Li · Ruixuan Li
[ ExHall D ]
Abstract
Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71\% and 5.34\% average MIoU in 1-shot and 5-shot scenarios, respectively.
Poster
Yan Yang · Liyuan Pan · Dongxu Li · Liu Liu
[ ExHall D ]
Abstract
This paper studies zero-shot object recognition using event camera data. Guided by CLIP, which is pre-trained on RGB images, existing approaches achieve zero-shot object recognition by optimizing embedding similarities between event data and RGB images respectively encoded by an event encoder and the CLIP image encoder. Alternatively, several methods learn RGB frame reconstructions from event data for the CLIP image encoder. However, they often result in suboptimal zero-shot performance.This study develops an event encoder without relying on additional reconstruction networks. We theoretically analyze the performance bottlenecks of previous approaches: the embedding optimization objectives are prone to suffer from the spatial sparsity of event data, causing semantic misalignments between the learned event embedding space and the CLIP text embedding space. To mitigate the issue, we explore a scalar-wise modulation strategy. Furthermore, to scale up the number of events and RGB data pairs for training, we also study a pipeline for synthesizing event data from static RGB images in mass.Experimentally, we demonstrate an attractive scaling property in the number of parameters and synthesized data. We achieve superior zero-shot object recognition performance on extensive standard benchmark datasets, even compared with past supervised learning approaches. For example, our model with a ViT/B-16 backbone achieves …
Poster
Xianing Chen · Si Huo · Borui Jiang · Hailin Hu · Xinghao Chen
[ ExHall D ]
Abstract
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first domain generalization few-shot counter, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
Poster
Hao Tan · Zichang Tan · Jun Li · Ajian Liu · Jun Wan · Zhen Lei
[ ExHall D ]
Abstract
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce the model to refocus on its surrounding local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on NUS-WIDE, MS-COCO, RAPv1, PA100K, MultiScene and MLRSNet datasets from three distinct domains, and shows great potential to boost the existing methods. Code will be made available.
Poster
Dongseob Kim · Hyunjung Shim
[ ExHall D ]
Abstract
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code will be publicly available.
Poster
Phi Vu Tran
[ ExHall D ]
Abstract
Recent years have witnessed tremendous advances on modern visual recognition systems. Despite such progress, many vision models still struggle with the open problem of learning from few exemplars. This paper focuses on the task of object detection in the setting where object classes follow a natural long-tailed distribution. Existing approaches to long-tailed detection resort to external ImageNet labels to augment the low-shot training instances. However, such dependency on a large labeled database is impractical and has limited utility in realistic scenarios. We propose a more versatile approach to leverage optional unlabeled images, which are easy to collect without the burden of human annotations. Our framework is straightforward and intuitive, and consists of three simple steps: (1) pre-training on abundant head classes; (2) transfer learning on scarce tail classes; and (3) fine-tuning on a sampled set of both head and tail classes. Our approach can be viewed as an improved head-to-tail model transfer paradigm without the added complexities of meta-learning or knowledge distillation, as was required in past research. By harnessing supplementary unlabeled images, without the need for extra image labels, SimLTD establishes new record results on the challenging LVIS v1 benchmark across both supervised and semi-supervised settings.
Poster
Aming Wu · Cheng Deng
[ ExHall D ]
Abstract
To accelerate the safe deployment of object detectors, we focus on reducing the impact of both covariate and semantic shifts. And we consider a realistic yet challenging scenario, namely Open-Domain Unknown Object Detection (ODU-OD), which aims to detect unknown objects in unseen target domains without accessing any auxiliary data. Towards ODU-OD, it is feasible to learn a robust discriminative boundary by synthesizing virtual features. Generally, perception, memory, and imagination are three essential capacities for human beings. Through multi-level perception and rich memory about known objects, the characteristics of unknown objects can be imagined sufficiently, enhancing the ability of discriminating known from unknown objects. Inspired by this idea, an approach of World Feature Simulation (WFS) is proposed, mainly consisting of a multi-level perception, memory recorder, and unknown-feature generator. Specifically, after extracting the features of the input, we separately employ a Mamba and Graph Network to obtain the global-level and connective-level representations. Next, a codebook containing multiple learnable codewords is defined to preserve fragmented memory of known objects. Meanwhile, we perform a modulated operation on the memory to form the imagination bank involving unknown characteristics. Finally, to alleviate the impact of lacking supervision data, based on the multi-level representation and imagination bank, …
Poster
Marc-Antoine Lavoie · Anas Mahmoud · Steven L. Waslander
[ ExHall D ]
Abstract
The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and …
Poster
Zhixiong Nan · Xianghong Li · Tao Xiang · Jifeng Dai
[ ExHall D ]
Abstract
Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture. The cascaded decoder architecture constrains object queries to update in the cascaded direction, only enabling object queries to learn relatively-limited information from image features. However, the challenges for object detection in natural scenes (e.g., extremely-small, heavily-occluded, and confusingly mixed with the background) require an object detection model to fully utilize image features, which motivates us to propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. MI enables object queries to learn more comprehensive information, and our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark under different backbones and training epochs, achieving +2.3 AP and +0.6 AP improvements compared to the most representative model DINO and SOTA model Relation-DETR under ResNet-50 backbone. In addition, a series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
Poster
Huixin Sun · Runqi Wang · Yanjing Li · Linlin Yang · Shaohui Lin · Xianbin Cao · Baochang Zhang
[ ExHall D ]
Abstract
Deep learning has significantly advanced the object detection field. However, tiny object detection (TOD) remains a challenging problem. We provide a new analysis method to examine the TOD challenge through occlusion-based attribution analysis in the frequency domain. We observe that tiny objects become less distinct after feature encoding and can benefit from the removal of high-frequency information. In this paper, we propose a novel approach named Spectral Enhancement for Tiny object detection (SET), which amplifies the frequency signatures of tiny objects in a heterogeneous architecture. SET includes two modules. The Hierarchical Background Smoothing (HBS) module suppresses high-frequency noise in the background through adaptive smoothing operations. The Adversarial Perturbation Injection (API) module leverages adversarial perturbations to increase feature saliency in critical regions and prompt the refinement of object features during training. Extensive experiments on four datasets demonstrate the effectiveness of our method. Especially, SET boosts the prior art RFLA by 3.2\% AP on the AI-TOD dataset.
Poster
Wenxi Chen · Raymond A. Yeh · Shaoshuai Mou · Yan Gu
[ ExHall D ]
Abstract
Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for the safe deployment of deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than IND inputs. From this observation, we proposed a meta-score function that searches for local minimum scores near original inputs by applying gradient descent. This procedure enhances the separability between in-distribution (IND) and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures or training protocol. To validate our approach, we conduct extensive experiments using the OpenOOD benchmark. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10% on FPR@95 compared to state-of-the-art methods.
Poster
Kaichen Yang · Junjie Cao · Zeyu Bai · Zhixun Su · Andrea Tagliasacchi
[ ExHall D ]
Abstract
We introduce the Pose and Illumination agnostic Anomaly Detection (PIAD) problem, a generalization of pose-agnostic anomaly detection (PAD). Being illumination agnostic is critical, as it relaxes the assumption that training data for an object has to be acquired in the same light configuration of the query images that we want to test. Moreover, even if the object is placed within the same capture environment, being illumination agnostic implies that we can relax the assumption that the relative pose between environment light and query object has to match the one in the training data. We introduce a new dataset to study this problem, containing both synthetic and real-world examples, propose a new baseline for PIAD, and demonstrate how our baseline provides state-of-the-art results in both PAD and PIAD, not only in the new proposed dataset but also in existing datasets that were designed for the simpler PAD problem. Source code and data will be made publicly available upon paper acceptance.
Poster
wenxin ma · Xu Zhang · Qingsong Yao · Fenghe Tang · Chenxu Wu · Yingtai Li · Rui Yan · Zihang Jiang · S Kevin Zhou
[ ExHall D ]
Abstract
Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved by a straightforward and effective two-stage approach: it first creates anomaly-aware text anchors to clearly differentiate normal and abnormal semantics, then aligns patch-level visual features with these anchors for precise anomaly localization. AA-CLIP uses lightweight linear residual adapters to maintain CLIP's generalization and improves AD performance efficiently. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. (code is available in Supplementary Material)
Poster
Ziming Huang · Xurui Li · Haotian Liu · Feng Xue · Yuzhe Wang · Yu Zhou
[ ExHall D ]
Abstract
Recently, multi-class anomaly classification has garnered increasing attention.Previous methods directly cluster anomalies but often struggle due to the lack of anomaly-prior knowledge.Acquiring this knowledge faces two issues: the non-prominent and weak-semantics anomalies.In this paper,we propose AnomalyNCD,a multi-class anomaly classification network compatible with different anomaly detection methods.To address the non-prominence of anomalies,we design main element binarization (MEBin) to obtain anomaly-centered images,ensuring anomalies are learned while avoiding the impact of incorrect detections.Next, to learn anomalies with weak semantics,we design mask-guided representation learning,which focuses on isolated anomalies guided by masksand reduces confusion from erroneous inputs through re-corrected pseudo labels.Finally, to enable flexible classification at both region and image levels,we develop a region merging strategy that determines the overall image category based on the classified anomaly regions.Our method outperforms the state-of-the-art works on the MVTec AD and MTD datasets.Compared with the current methods,AnomalyNCD combined with zero-shot anomaly detection method achieves a 10.8\% F1 gain,8.8\% NMI gain,and 9.5\% ARI gain on MVTec AD,and 12.8\% F1 gain,5.7\% NMI gain,and 10.8\% ARI gain on MTD.The code will be publicly available.
Poster
Xiaofan Li · Xin Tan · Zhuo Chen · Zhizhong Zhang · Ruixin Zhang · Rizen Guo · Guannan Jiang · Yulong Chen · Yanyun Qu · Lizhuang Ma · Yuan Xie
[ ExHall D ]
Abstract
With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe faithfulness hallucination and catastrophic forgetting, which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of over-fitting to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA.
Poster
Shibin Mei · Hang Wang · Bingbing Ni
[ ExHall D ]
Abstract
Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time, to mine correlations between samples, aiming to address the limitations of common distance metric. Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning. Specifically, we construct a graph structure to represent the adjacency relationships among samples by thresholding distances between them and then apply the shortest-path algorithm to obtain geodesic distance within this graph. To facilitate efficient computation, we further propose a hierarchical graph structure through clustering and combined with incremental update strategies for dynamic status updates. Extensive experiments across various downstream tasks validate the effectiveness of our proposed method, demonstrating its capability to capture complex relationships between samples and improve the performance of multimodal learning models.
Poster
Seonggon Kim · Juncheol Shin · Seung-taek Woo · Eunhyeok Park
[ ExHall D ]
Abstract
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75\% memory savings and a 2.6× acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.
Poster
Zhiqiang Shen · Ammar Sherif · Zeyuan Yin · Shitong Shao
[ ExHall D ]
Abstract
Recent advances in dataset distillation have led to solutions in two main directions. The conventional batch-to-batch matching mechanism is ideal for small-scale datasets and includes bi-level optimization methods on models and syntheses, such as FRePo, RCIG, and RaT-BPTT, as well as other methods like distribution matching, gradient matching, and weight trajectory matching. Conversely, batch-to-global matching typifies decoupled methods, which are particularly advantageous for large-scale datasets. This approach has garnered substantial interest within the community, as seen in SRe2L, G-VBSM, WMDD, and CDA. A primary challenge with the second approach is the lack of diversity among syntheses within each class since samples are optimized independently and the same global supervision signals are reused across different synthetic images. In this study, we propose a new Diversity-driven EarlyLate Training (DELT) scheme to enhance the diversity of images in batch-to-global matching with less computation. Our approach is conceptually simple yet effective, it partitions predefined IPC samples into smaller subtasks and employs local optimizations to distill each subset into distributions from distinct phases, reducing the uniformity induced by the unified optimization process. These distilled images from the subtasks demonstrate effective generalization when applied to the entire task. We conduct extensive experiments on CIFAR, Tiny-ImageNet, ImageNet-1K, …
Poster
Jiamu Zhang · Shaochen (Henry) Zhong · Andrew Ye · Zirui Liu · Sebastian Zhao · Kaixiong Zhou · Li Li · Soo-Hyun Choi · Rui Chen · Xia Hu · Shuai Xu · Vipin Chaudhary
[ ExHall D ]
Abstract
Densely structured pruning methods — which generate pruned models in a fully dense format, allowing immediate compression benefits without additional demands — are evolving owing to their practical significance. Traditional techniques in this domain mainly revolve around coarser granularities, such as filter pruning, thereby limiting their performance due to restricted pruning freedom.Recent advancements in Grouped Kernel Pruning (GKP) have enabled the utilization of finer granularity while maintaining the densely structured format. We observed that existing GKP methods often introduce dynamic operations to different aspects of their procedures, where many were done so at the cost of adding complications and/or imposing limitations — e.g., requiring an expensive mixture of clustering schemes; or having dynamic pruning rates and sizes among groups, which lead to reliance on custom architecture support for its pruned models.In this work, we argue the best practice to introduce such dynamic operation to GKP is to make Conv2d(groups) (a.k.a. group count) flexible under an integral optimization, leveraging its ideal alignment with the infrastructure support of Grouped Convolution. Pursuing such direction, we present a one-shot, post-train, data-agnostic GKP method that is more performant, adaptive, and efficient than its predecessors; while simultaneously being a lot more user-friendly with little-to-no hyper-parameter tuning …
Poster
Fu Feng · Yucheng Xie · Jing Wang · Xin Geng
[ ExHall D ]
Abstract
The growing complexity of model parameters underscores the significance of pre-trained models; however, the constraints encountered during deployment necessitate the models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address initialization problems when target models are incompatible with pre-trained ones. We tackle this issue from a multitasking perspective and introduce \textbf{WAVE}, which incorporates a set of shared \textbf{W}eight templates for \textbf{A}daptive initialization of \textbf{V}ariable-siz\textbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their sizes, which are sufficient to learn the connection rules of weight templates based on the Kronecker Product from a limited amount of data.To construct the weight templates, WAVE utilizes the \textit{Learngene} framework, which condenses the size-agnostic knowledge from ancestry models into the weight templates as the learngenes through knowledge distillation. This process structures the pre-trained models' knowledge according to the rules of weight templates. We provide a comprehensive benchmark for learngenes, with extensive experiments demonstrating WAVE’s efficiency. Results show that WAVE achieves state-of-the-art performance when initializing models of various depth and width. With only 10 epochs of training, n variable-sized models initialized by WAVE outperform directly pre-trained models trained over 150 epochs, particularly for smaller models, achieving approximately …
Poster
Hongxu chen · Zhen Wang · Runshi Li · Bowei Zhu · Long Chen
[ ExHall D ]
Abstract
Low-rank adaptations (LoRA) are widely used to fine-tune large models across various domains for specific downstream tasks. While task-specific LoRAs are often available, concerns about data privacy and intellectual property can restrict access to training data, limiting the acquisition of a multi-task model through gradient-based training. In response, LoRA merging presents an effective solution by combining multiple LoRAs into a unified adapter while maintaining data privacy. Prior works on LoRA merging primarily frame it as an optimization problem, yet these approaches face several limitations, including the rough assumption about input features utilized in optimization, massive sample requirements, and the unbalanced optimization objective. These limitations can significantly degrade performance. To address these, we propose a novel optimization-based method, named IterIS: 1) We formulate LoRA merging as an advanced optimization problem to mitigate the rough assumption. Additionally, we employ an iterative inference-solving framework in our algorithm. It can progressively refine the optimization objective for improved performance. 2) We introduce an efficient regularization term to reduce the need for massive sample requirements (requiring only 1-5\% of the unlabeled samples compared to prior methods). 3) We utilize adaptive weights in the optimization objective to mitigate potential unbalances in LoRA merging process. Our method demonstrates …
Poster
Qiang Wang · Xiang Song · Yuhang He · Jizhou Han · Chenhao Ding · Xinyuan Gao · Yihong Gong
[ ExHall D ]
Abstract
Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continuous model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO’s consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The source code will be released to the community.
Poster
Yuhao Zhou · Yuxin Tian · Jindi Lv · Mingjia Shi · Yuanxi Li · Qing Ye · Shuhao Zhang · Jiancheng Lv
[ ExHall D ]
Abstract
In the realm of high-frequency data streams, achieving real-time learning within varying memory constraints is paramount. This paper presents Ferret, a comprehensive framework designed to enhance \emph{online accuracy} of Online Continual Learning (OCL) algorithms while dynamically adapting to varying memory budgets.Ferret employs a fine-grained pipeline parallelism strategy combined with an iterative gradient compensation algorithm, ensuring seamless handling of high-frequency data with minimal latency, and effectively counteracting the challenge of stale gradients in parallel training. To adapt to varying memory budgets, its automated model partitioning and pipeline planning optimizes performance regardless of memory limitations. Extensive experiments across 20 benchmarks and 5 integrated OCL algorithms show Ferret’s remarkable efficiency, achieving up to 3.7× lower memory overhead to reach the same online accuracy compared to competing methods.Furthermore, Ferret consistently outperforms these methods across diverse memory budgets, underscoring its superior adaptability. These findings position Ferret as a premier solution for efficient and adaptive OCL framework in real-time environments.
Poster
Xiaohan Zou · Wenchao Ma · Shu Zhao
[ ExHall D ]
Abstract
Recent advancements in prompt-based learning have significantly advanced image and video class-incremental learning. However, the prompts learned by these methods often fail to capture the diverse and informative characteristics of videos, and struggle to generalize effectively to future tasks and classes. To address these challenges, this paper proposes modeling the distribution of space-time prompts conditioned on the input video using a diffusion model. This generative approach allows the proposed model to naturally handle the diverse characteristics of videos, leading to more robust prompt learning and enhanced generalization capabilities. Additionally, we develop a mechanism that transfers the token relationship modeling capabilities of a pre-trained image transformer to spatio-temporal modeling for videos. Our approach has been thoroughly evaluated across four established benchmarks, showing remarkable improvements over existing state-of-the-art methods in video class-incremental learning.
Poster
Hao Yu · Xin Yang · Le Zhang · Hanlin Gu · Tianrui Li · Lixin Fan · Qiang Yang
[ ExHall D ]
Abstract
Federated continual learning (FCL) allows each client to continually update its knowledge from task streams, enhancing the applicability of federated learning in real-world scenarios. However, FCL needs to address not only spatial data heterogeneity between clients but also temporal data heterogeneity between tasks. In this paper, empirical experiments demonstrate that such input-level heterogeneity significantly affects the model's internal parameters and outputs, leading to severe spatial-temporal catastrophic forgetting of previous and local knowledge. To this end, we propose Federated Tail Anchor (FedTA) to mix trainable Tail Anchor with the frozen output features to adjust their position in the feature space, thereby overcoming parameter-forgetting and output-forgetting. Moreover, three novel components are also included in FedTA: Input Enhancement for improving the performance of pre-trained models on downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous local knowledge on the server side; and Best Global Prototype Selection for finding the best anchor point for each class in the feature space. Extensive experiments demonstrate that FedTA not only outperforms existing FCL methods but also effectively preserves the relative positions of features, remaining unaffected by spatial and temporal changes.
Poster
Takuma Fukuda · Hiroshi Kera · Kazuhiko Kawamoto
[ ExHall D ]
Abstract
We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an exemplar-free framework for class-incremental learning (CIL) that addresses both catastrophic forgetting and scalability. While existing methods trade-off between inference time and accuracy, ACMap consolidates task-specific adapters into a single adapter, ensuring constant inference time across tasks without compromising accuracy.The framework employs adapter merging to build a shared subspace that aligns task representations and mitigates forgetting, while centroid prototype mapping maintains high accuracy through consistent adaptation in the shared subspace. To further improve scalability, an early stopping strategy limits adapter merging as tasks increase. Extensive experiments on five benchmark datasets demonstrate that ACMap matches state-of-the-art accuracy while maintaining inference speed comparable to the fastest existing methods.
Poster
Guannan Lai · Yujie Li · Xiangkun Wang · Junbo Zhang · Tianrui Li · Xin Yang
[ ExHall D ]
Abstract
Class Incremental Learning (CIL) requires a model to continuously learn new classes without forgetting previously learned ones. While recent studies have significantly alleviated the problem of catastrophic forgetting (CF), more and more research reveals that the order in which classes appear have significant influences on CIL models. Specifically, prioritizing the learning of classes with lower similarity will enhance the model's generalization performance and its ability to mitigate forgetting.Hence, it is imperative to develop an order-robust class incremental learning model that maintains stable performance even when faced with varying levels of class similarity in different orders.In response, we first provide additional theoretical analysis, which reveals that when the similarity among a group of classes is lower, the model demonstrates increased robustness to the class order.Then, we introduce a novel **G**raph-**D**riven **D**ynamic **S**imilarity **G**rouping (**GDDSG**) method, which leverages a graph coloring algorithm for class-based similarity grouping. The proposed approach trains independent CIL models for each group of classes, ultimately combining these models to facilitate joint prediction.Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability.Our code is included in the supplementary materials.
Poster
Vaibhav Rathore · Shubhranil B · Saikat Dutta · Sarthak Mehrotra · Zsolt Kira · Biplab Banerjee
[ ExHall D ]
Abstract
Generalized Class Discovery (GCD) clusters base and novel classes in a target domain, using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain—with a distinct data distribution—remains unseen until inference. To this end, our solution, \ourmodel, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tunings on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base …
Poster
Yue Zhang · Mingyue Bin · Yuyang Zhang · Zhongyuan Wang · Zhen Han · Chao Liang
[ ExHall D ]
Abstract
Unsupervised domain adaptation (UDA) aims to learn discriminative features from a labeled source domain by supervised learning and to transfer the knowledge to an unlabeled target domain via distribution alignment. However, in some real-world scenarios, e.g., public safety or access control, it’s difficult to obtain sufficient source domain data, which hinders the application of existing UDA methods. To this end, this paper investigates a realistic but rarely studied problem called one-shot unsupervised domain adaptation (OSUDA), where there is only one example per category in the source domain and abundant unlabeled samples in the target domain. Compared with UDA, OSUDA faces dual challenges in both feature learning and domain alignment due to the lack of sufficient source data. To address these challenges, we propose a simple but effective link-based contrastive learning (LCL) method for OSUDA. On the one hand, with the help of in-domain links that indicate whether two samples are from the same cluster, LCL can learn discriminative features with abundant unlabeled target data. On the other hand, by constructing cross-domain links that show whether two clusters are bidirectionally matched, LCL can realize accurate domain alignment with only one source sample per category. Extensive experiments conducted on 4 public domain …
Poster
Weiming Liu · Jun Dan · Fan Wang · Xinting Liao · Junhao Dong · Hua Yu · Shunjie Dong · Lianyong Qi
[ ExHall D ]
Abstract
Nowadays, domain adaptation techniques have been widely investigated for knowledge sharing from labeled source domain to unlabeled target domain. However, target domain may include some data samples that belong to unknown categories in real-world scenarios. Moreover, the target domain cannot access the source data samples due to privacy-preserving restrictions. In this paper, we focus on the source-free open set domain adaptation problem which includes two main challenges, i.e.,how to distinguish known and unknown target samples and how to exploit useful source information to provide trustworthy pseudo labels for known target samples. Existing approaches that directly applying conventional domain alignment methods could lead to sample mismatch and misclassification in this scenario.To overcome these issues, we propose Distinguish Then Exploit model (DTE) with two components, i.e., weight barcode estimation and sparse label assignment. Weight barcode estimation first calculate marginal probability of target samples via partially unbalanced optimal transport then quantize barcode results to distinguish unknown target sample. Sparse label assignment utilizes sparse sample-label matching via proximal term to fully exploit useful source information. Our empirically study on several datasets shows that DTE outperforms the state-of-the-art models on tackling the source-free open set domain adaptation problem.
Poster
Shinnosuke Matsuo · Riku Togashi · Ryoma Bise · Seiichi Uchida · Masahiro Nomura
[ ExHall D ]
Abstract
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost.
Poster
Sk Miraj Ahmed · Umit Basaran · Dripta S. Raychaudhuri · Arindam Dutta · Rohit Kundu · Fahim Faisal Niloy · Basak Guler · Amit K. Roy-Chowdhury
[ ExHall D ]
Abstract
As machine learning become more pervasive and data privacy regulations evolve, the ability to remove private or copyrighted information from trained models is becoming an increasingly critical requirement. Existing unlearning methods often rely on the assumption of having access to the entire training dataset during the forgetting process. However, this assumption may not hold true in practical scenarios where the original training data may not be accessible, i.e., the source-free setting. To address this challenge, we focus on the source-free unlearning scenario, where an unlearning algorithm must be capable of removing specific data from a trained model without requiring access to the original training dataset. Building on recent work, we present a method that can estimate the Hessian of the unknown remaining training data, a crucial component required for efficient unlearning. Leveraging this estimation technique, our method enables efficient zero-shot unlearning while providing robust theoretical guarantees on the unlearning performance, while maintaining performance on the remaining data. Extensive experiments over a wide range of datasets verify the efficacy of our method.
Poster
Taero Kim · Subeen Park · Sungjun Lim · Yonghan Jung · Krikamol Muandet · Kyungwoo Song
[ ExHall D ]
Abstract
Learning robust models under distribution shifts between training and test datasets is a fundamental challenge in machine learning. While learning invariant features across environments is a popular approach, it often assumes that these features are fully observed in both training and test sets—a condition frequently violated in practice. When models rely on invariant features absent in the test set, their robustness in new environments can deteriorate. To tackle this problem, we introduce a novel learning principle called the Sufficient Invariant Learning (SIL) framework, which focuses on learning a sufficient subset of invariant features rather than relying on a single feature. After demonstrating the limitation of existing invariant learning methods, we propose a new algorithm, Adaptive Sharpness-aware Group Distributionally Robust Optimization (ASGDRO), to learn diverse invariant features by seeking common flat minima across the environments. We theoretically demonstrate that finding a common flat minima enables robust predictions based on diverse invariant features. Empirical evaluations on multiple datasets, including our new benchmark, confirm ASGDRO's robustness against distribution shifts, highlighting the limitations of existing methods.
Poster
Zhiwei Ling · Yachen Chang · Hailiang Zhao · Xinkui Zhao · Kingsum Chow · Shuiguang Deng
[ ExHall D ]
Abstract
Deep neural networks (DNNs) have been widely criticized for their overconfidence when dealing with out-of-distribution (OOD) samples, highlighting the critical need for effective OOD detection to ensure the safe deployment of DNNs in real-world settings. Existing post-hoc OOD detection methods primarily enhance the discriminative power of logit-based approaches by reshaping sample features, yet they often neglect critical information inherent in the features themselves. In this paper, we propose the C_lass-A_ware Re_lative F_eature-based method (CARef), which utilizes the error between a sample’s feature and its class-aware average feature as a discriminative criterion. To further refine this approach, we introduce the C_lass-A_ware D_ecoupled Re_lative F_eature-based method (CADRef), which decouples sample features based on the alignment of signs between the relative feature and corresponding model weights, enhancing the discriminative capabilities of CARef.Extensive experimental results across multiple datasets and models demonstrate that both proposed methods exhibit effectiveness and robustness in OOD detection compared to state-of-the-art methods. Specifically, our two methods outperform the best baseline by 2.82\% and 3.27\% in AUROC, with improvements of 4.03\% and 6.32\% in FPR95, respectively.
Poster
Zheng Wang · Zihui Wang · Zheng Wang · Xiaoliang Fan · Cheng Wang
[ ExHall D ]
Abstract
Federated learning (FL) is emerging as a promising technique for collaborative learning without local data leaving their devices. However, clients' data originating from diverse domains may degrade model performance due to domain shifts, preventing the model from learning consistent representation space. In this paper, we propose a novel FL framework, Federated Domain Shift Eraser (FDSE), to improve model performance by differently erasing each client's domain skew and enhancing their consensus. First, we formulate the model forward passing as an iterative deskewing process that extracts and then deskews features alternatively. This is efficiently achieved by decomposing each original layer in the neural network into a Domain-agnostic Feature Extractor (DFE) and a Domain-specific Skew Eraser (DSE). Then, a regularization term is applied to promise the effectiveness of feature deskewing by pulling local statistics of DSE's outputs close to the globally consistent ones. Finally, DFE modules are fairly aggregated and broadcast to all the clients to maximize their consensus, and DSE modules are personalized for each client via similarity-aware aggregation to erase their domain skew differently. Comprehensive experiments were conducted on three datasets to confirm the advantages of our method in terms of accuracy, efficiency, and generalizability.
Poster
Run He · Kai Tong · Di Fang · Han Sun · Ziqian Zeng · Haoran Li · Tianyi Chen · Huiping Zhuang
[ ExHall D ]
Abstract
In this paper, we introduce analytic federated learning (AFL), a new training paradigm that brings analytical (i.e., closed-form) solutions to the federated learning (FL) with pre-trained models. Our AFL draws inspiration from analytic learning---a gradient-free technique that trains neural networks with analytical solutions in one epoch. In the local client training stage, the AFL facilitates a one-epoch training, eliminating the necessity for multi-epoch updates. In the aggregation stage, we derive an absolute aggregation (AA) law. This AA law allows a single-round aggregation, reducing heavy communication overhead and achieving fast convergence by removing the need for multiple aggregation rounds. More importantly, the AFL exhibits a property that invariance to data partitioning, meaning that regardless of how the full dataset is distributed among clients, the aggregated result remains identical. This could spawn various potentials, such as data heterogeneity invariance and client-number invariance. We conduct experiments across various FL settings including extremely non-IID ones, and scenarios with a large number of clients (e.g., ≥1000). In all these settings, our AFL constantly performs competitively while existing FL techniques encounter various obstacles.
Poster
Naveen Kumar Kummari · Ranjeet Ranjan Jha · Krishna Mohan Chalavadi · Ravindra Babu Tallamraju
[ ExHall D ]
Abstract
Ensuring auditability and verifiability in FL is both challenging and essential to guarantee that local data remains untampered and client updates are trustworthy. Recent FL frameworks assess client contributions through a trusted central server using various client selection and aggregation techniques. However, reliance on a central server can create a single point of failure, making it vulnerable to privacy-centric attacks and limiting its ability to audit and verify client-side data contributions due to restricted access. In addition, data quality and fairness evaluations are often inadequate, failing to distinguish between high-impact contributions and those from low-quality or poisoned data. To address these challenges, we propose Federated Auditable and Verifiable Data valuation (FAVD), a privacy-preserving method that ensures auditability and verifiability of client contributions through data valuation, independent of any central authority or predefined training algorithm. FAVD utilizes shared local data density functions to construct a global density function, aligning data contributions and facilitating effective valuation prior to local model training. This proactive approach improves transparency in data valuation and ensures that only benign updates are generated, even in the presence of malicious data. Further, to mitigate privacy risks associated with sharing data density functions, we add Gaussian noise to each client’s …
Poster
Tae-Young Lee · Sundong Park · Minwoo Jeon · Hyoseok Hwang · Gyeong-Moon Park
[ ExHall D ]
Abstract
As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods.
Poster
Jiate Li · Meng Pang · Yun Dong · Binghui Wang
[ ExHall D ]
Abstract
Graph neural networks (GNNs) are becoming the de facto method to learn on the graph data and have achieved the state-of-the-art on node and graph classification tasks. However, recent works show GNNs are vulnerable to training-time poisoning attacks -- marginally perturbing edges, nodes, and node features of training graphs can largely degrade the GNN performance. Most previous defenses against such attacks are empirical and are soon broken by adaptive / stronger ones. A few provable defenses provide robustness guarantees, but have large gaps when applied in practice: 1) all restrict the attacker’s capability to only one type of perturbation; 2) all are designed for a particular GNN task; and 3) their robustness guarantees are not 100\% accurate. In this work, we bridge all these gaps by developing PGNNCert, the first certified defense of GNNs against poisoning attacks under arbitrary (edge, node, and node feature) perturbations with deterministic (100\% accurate) robustness guarantees. PGNNCert is also applicable to the two most widely-studied node and graph classification tasks. Extensive evaluations on multiple node and graph classification datasets and GNNs demonstrate the effectiveness of PGNNCert to provably defend against arbitrary poisoning perturbations. PGNNCert is also shown to significantly outperform the state-of-the-art certified defenses against …
Poster
Keke Tang · Chao Hou · Weilong Peng · Xiang Fang · Zhize Wu · Yongwei Nie · Wenping Wang · Zhihong Tian
[ ExHall D ]
Abstract
Deep neural networks (DNNs) often exhibit out-of-distribution (OOD) overconfidence, producing overly confident predictions on OOD samples. We attribute this issue to the inherent over-complexity of DNNs and investigate two key aspects: capacity and nonlinearity. First, we demonstrate that reducing model capacity through knowledge distillation can effectively mitigate OOD overconfidence. Second, we show that selectively reducing nonlinearity by removing ReLU operations further alleviates the issue. Building on these findings, we present a practical guide to model simplification, combining both strategies to significantly reduce OOD overconfidence. Extensive experiments validate the effectiveness of this approach in mitigating OOD overconfidence and demonstrate its superiority over state-of-the-art methods. Additionally, our simplification strategies can be combined with existing OOD detection techniques to further enhance OOD detection performance. Codes will be made publicly available upon acceptance.
Poster
Ping Guo · Cheng Gong · Fei Liu · Xi Lin · Zhichao Lu · Qingfu Zhang · Zhenkun Wang
[ ExHall D ]
Abstract
Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions.
Poster
Banglong Liu · Niuniu Qi · Xia Zeng · Lydia Dehbi · Zhengfeng Yang
[ ExHall D ]
Abstract
Polynomial inequality proving is fundamental to many mathematical disciplines and finds wide applications in diverse fields. Current traditional algebraic methods are based on searching for a polynomial positive definite representation over a set of basis. However, these methods are limited by truncation degree.To address this issue, this paper proposes an approach based on reinforcement learning to find a Krivine-basis representation for proving polynomial inequalities.Specifically, we formulate the inequality proving problem as a linear programming (LP) problem and encode it as a basis selection problem using reinforcement learning (RL), achieving a non-negative Krivine basis. Moreover, a fast multivariate polynomial multiplication method based on Fast Fourier Transform (FFT) is deployed to enhance the efficiency of action space search. Furthermore, we have implemented a tool called APPIRL (Automated Proof of Polynomial Inequalities via Reinforcement Learning).Experimental evaluation on benchmark problems demonstrates the feasibility and effectiveness of our approach. In addition, APPIRL can successfully be applied to attack the maximum stable set problem.
Poster
HaiMing Xu · Qianqian Wang · Boyue Wang · Quanxue Gao
[ ExHall D ]
Abstract
Multi-view clustering, while effective in integrating information from diverse data sources, may lead to biased outcomes when sensitive attributes are involved. Despite the substantial progress in recent research, most existing methods suffer from limited interpretability and lack strong mathematical theoretical foundations. In this work, we propose a novel approach, \textbf{D}eep \textbf{F}air \textbf{M}ulti-\textbf{V}iew \textbf{C}lustering with \textbf{A}ttention \textbf{K}olmogorov-\textbf{A}rnold \textbf{N}etwork (DFMVC-AKAN), designed to generate fair clustering results while maintaining robust performance. DFMVC-AKAN integrates attention mechanisms with multi-view data reconstruction to enhance both clustering accuracy and fairness. The model introduces an attention mechanism and Kolmogorov-Arnold Networks (KAN), which together address the challenges of feature fusion and the influence of sensitive attributes in multi-view data. The attention mechanism enables the model to dynamically focus on the most relevant features across different views, while KAN provides a nonlinear feature representation capable of efficiently approximating arbitrary multivariate continuous functions, thereby capturing complex relationships and latent patterns within the data. Experimental results on four datasets containing sensitive attributes demonstrate that DFMVC-AKAN significantly improves fairness and clustering performance compared to state-of-the-art methods.
Poster
yuzhuo dai · Jiaqi Jin · Zhibin Dong · Siwei Wang · Xinwang Liu · En Zhu · Xihong Yang · Xinbiao Gan · Yu Feng
[ ExHall D ]
Abstract
In incomplete multi-view clustering (IMVC), missing data introduces noise, causing view prototypes to shift and exhibit inconsistent semantics across views. A feasible solution is to explore cross-view consistency in paired complete observations for imputation and alignment. However, existing paradigm is limited to instance- or cluster-level. The former emphasizes instances alignment across views, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cluster counterparts across views but ignores intra-cluster observations within views. Neither can construct a semantic space shared for all view data to learn consensus semantic representation. Meanwhile, excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. To this end, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). For consensus semantics learning, a consensus prototypes, linearly weighted by all available data, is introduced to promote consensus assignments between paired observations in a shared semantic space, called as prototype-based contrastive clustering. For cluster semantics enhancement, we design a heuristic graph clustering based on modularity metric for specific view to recover cluster structure with intra-cluster compactness and inter-cluster separation. Extensive experiments demonstrate, compared to existing state-of-the-art competitors, FreeCSL is better at achieving confident and robust …
Poster
Zijian Dong · Yilei Wu · Chongyao Chen · Yingtian Zou · Yichi Zhang · Juan Helen Zhou
[ ExHall D ]
Abstract
In representation learning, uniformity refers to the uniform feature distribution in the latent space (*i.e.*, unit hypersphere). Previous work has shown that improving uniformity contributes to the learning of under-represented classes. However, most of the previous work focused on classification; the representation space of imbalanced regression remains unexplored. Classification-based methods are not suitable for regression tasks because they cluster features into distinct groups without considering the continuous and ordered nature essential for regression. In a geometric aspect, we uniquely focus on ensuring uniformity in the latent space for imbalanced regression through two key losses: **enveloping** and **homogeneity**. The enveloping loss encourages the induced trace to uniformly occupy the surface of a hypersphere, while the homogeneity loss ensures smoothness, with representations evenly spaced at consistent intervals. Our method integrates these geometric principles into the data representations via a **S**urrogate-driven **R**epresentation **L**earning (**SRL**) framework. Experiments with real-world regression and operator learning tasks highlight the importance of uniformity in imbalanced regression and validate the efficacy of our geometry-based loss functions.
Poster
Shanglin Liu · Jianming Lv · Jingdan Kang · Huaidong Zhang · Zequan Liang · Shengfeng He
[ ExHall D ]
Abstract
Multimodal unsupervised domain adaptation leverages unlabeled data in the target domain to enhance multimodal systems continuously. While current state-of-the-art methods encourage interaction between sub-models of different modalities through pseudo-labeling and feature-level exchange, varying sample quality across modalities can lead to the propagation of inaccurate information, resulting in error accumulation. To address this, we propose Modal-Affinity Multimodal Domain Adaptation (MODfinity), a method that dynamically manages multimodal information flow through fine-grained control over teacher model selection, guiding information intertwining at both feature and label levels. By treating labels as an independent modality, MODfinity enables balanced performance assessment across modalities, employing a novel modal-affinity measurement to evaluate information quality. Additionally, we introduce a modal-affinity distillation technique to control sample-level information exchange, ensuring reliable multimodal interaction based on affinity evaluations within the feature space. Extensive experiments on three multimodal datasets demonstrate that our framework consistently outperforms state-of-the-art methods, particularly in high-noise environments.
Poster
Yingxue Xu · Fengtao ZHOU · Chenyu Zhao · Yihui Wang · Can Yang · Hao Chen
[ ExHall D ]
Abstract
The integration of multimodal data including pathology images and gene profiles is widely applied in precise survival prediction. Despite recent advances in multimodal survival models, collecting complete modalities for multimodal fusion still poses a significant challenge, hindering their application in clinical settings. Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. In the first stage, Unimodal Prompting (UniPro) distills the knowledge distribution of each modality, preparing for supplementing modality-specific knowledge of the missing modality in the subsequent stage. In the second stage, Multimodal Prompting (MultiPro) leverages available modalities as prompts for LLMs to infer the missing modality, which provides modality-common information. Simultaneously, the unimodal knowledge acquired in the first stage is injected into multimodal inference to compensate for the modality-specific knowledge of the missing modality. Extensive experiments covering various missing scenarios demonstrated the superiority of the proposed method. The code will be released upon acceptance.
Poster
Wei Li · jiawei jiang · Jie Wu · Kaihao Yu · Jianwei Zheng
[ ExHall D ]
Abstract
Interpretability and consistency have long been crucial factors in MRI reconstruction. While interpretability has been significantly innovated with the emerging deep unfolding networks, current solutions still suffer from inconsistency issues and produce inferior anatomical structures. Especially in out-of-distribution cases, e.g., when the acceleration rate (AR) varies, the generalization performance is often catastrophic. To counteract the dilemma, we propose an innovative Linear Mamba Operator (LMO) to ensure consistency and generalization, while still enjoying desirable interpretability. Theoretically, we argue that mapping between function spaces, rather than between signal instances, provides a solid foundation of high generalization. Technically, LMO achieves a good balance between global integration facilitated by a state space model that scans the whole function domain, and local integration engaged with an appealing property of continuous-discrete equivalence. On that basis, learning holistic features can be guaranteed, tapping the potential of maximizing data consistency. Quantitative and qualitative results demonstrate that LMO significantly outperforms other state-of-the-arts. More importantly, LMO is the unique model that, with AR changed, achieves retraining performance without retraining steps. Codes are attached and will be released on GitHub.
Poster
Xiao Wang · Fuling Wang · Yuehang Li · Qingchuan Ma · Shiao Wang · Bo Jiang · Jin Tang
[ ExHall D ]
Abstract
X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models' insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. All the source codes …
Poster
Ying Chen · Guoan Wang · Yuanfeng Ji · Yanjun Li · Jin Ye · Tianbin Li · Ming Hu · Rongshan Yu · Yu Qiao · Junjun He
[ ExHall D ]
Abstract
Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in various settings such as microscopy, diagnosis and clinical. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities, achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.
Poster
Ziyuan Yang · Yingyu Chen · Zhiwen Wang · Hongming Shan · Yang Chen · Yi Zhang
[ ExHall D ]
Abstract
Reducing radiation doses benefits patients, however, the resultant low-dose computed tomography (LDCT) images often suffer from clinically unacceptable noise and artifacts. While deep learning (DL) shows promise in LDCT reconstruction, it requires large-scale data collection from multiple clients, raising privacy concerns. Federated learning (FL) has been introduced to address these privacy concerns; however, current methods are typically tailored to specific scanning protocols, which limits their generalizability and makes them less effective for unseen protocols. To address these issues, we propose SCAN-PhysFed, a novel SCanning- and ANatomy-level personalized Physicis-Driven Federated learning paradigm for LDCT reconstruction. Since the noise distribution in LDCT data is closely tied to scanning protocols and anatomical structures being scanned, we design a dual-level physics-informed way to address these challenges. Specifically, we incorporate physical and anatomical prompts into our physics-informed hypernetworks to capture scanning- and anatomy-specific information, enabling dual-level physics-driven personalization of imaging features. These prompts are derived from the scanning protocol and the radiology report generated by a medical large language model (MLLM), respectively. Subsequently, client-specific decoders project these dual-level personalized imaging features back into the image domain. Besides, to tackle the challenge of unseen data, we introduce a novel protocol vector-quantization strategy (PVQS), which ensures consistent …
Poster
Shaohao Rui · Lingzhi Chen · Zhenyu Tang · Lilong Wang · Mianxin Liu · Shaoting Zhang · Xiaosong Wang
[ ExHall D ]
Abstract
Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28\%-14.47\% across six segmentation benchmarks and a consistent accuracy boost of 0.65\%-18.07\% in four individual image classification tasks.
Poster
Qinghe Ma · Jian Zhang · Zekun Li · Lei Qi · Qian Yu · Yinghuan Shi
[ ExHall D ]
Abstract
Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains.In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements.In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31\% on the Prostate dataset. Our code is available in the supplementary material.
Poster
Tassilo Wald · Constantin Ulrich · Stanislav Lukyanenko · Andrei Goncharov · Alberto Paderno · Maximilian Miller · Leander Maerkisch · Paul F Jaeger · Klaus Maier-Hein
[ ExHall D ]
Abstract
Self-Supervised Learning (SSL) presents an exciting opportunity to unlock the potential of vast, untapped clinical datasets, for various downstream applications that suffer from the scarcity of labeled data.While SSL has revolutionized fields like natural language processing and computer vision, its adoption in 3D medical image computing has been limited by three key pitfalls: Small pre-training dataset sizes, architectures inadequate for 3D medical image analysis, and insufficient evaluation practices. In this paper, we address these issues by i) leveraging a large-scale dataset of 39k 3D brain MRI volumes and ii) using a Residual Encoder U-Net architecture within the state-of-the-art nnU-Net framework. iii) A robust development framework, incorporating 5 development and 8 testing brain MRI segmentation datasets, allowed performance-driven design decisions to optimize the simple concept of Masked Auto Encoders (MAEs) for 3D CNNs. The resulting model not only surpasses previous SSL methods but also outperforms the strong nnU-Net baseline by an average of approximately 3 Dice points setting a new state-of-the-art.Our code and models are made available.
Poster
Feng Yu · Jiacheng Cao · Li Liu · Minghua Jiang
[ ExHall D ]
Abstract
Multimodal 3D segmentation involves a significant number of 3D convolution operations, which requires substantial computational resources and high-performance computing devices in MRI multimodal brain tumor segmentation. The key challenge in multimodal 3D segmentation is how to minimize network computational load while maintaining high accuracy. To address the issue, a novel lightweight parameter aggregation network (SuperLightNet) is proposed to realize the efficient encoder and decoder for the high accurate and low computation. A random multiview drop encoder is designed to learn the spatial structure of multimodal images through a random multi-view approach for solving the high computational time complexity that has arisen in recent years with methods relying on transformers and Mamba. A learnable residual skip decoder is designed to incorporate learnable residual and group skip weights for addressing the reduced computational efficiency caused by the use of overly heavy convolution and deconvolution decoders. Experimental results demonstrate that the proposed method achieves a leading reduction in parameter count by 95.59\%, the 96.78\% improvement in computational efficiency, the 96.86\% enhancement in memory access performance, and the average performance gain of 0.21\% on the BraTS2019 and BraTS2021 datasets in comparison with the state-of-the-art methods. Code is available at https://github.com/WTU1020-Medical-Segmentation/SuperLightNet.
Poster
Jiongtong Hu · Wei Zhuo · Jun Cheng · YINGYING LIU · Wufeng Xue · Dong Ni
[ ExHall D ]
Abstract
In clinical practice of echocardiography examinations, multiple planes containing the heart structures of different view are usually required in screening, diagnosis and treatment of cardiac disease. AI models for echocardiography have to be tailored for a specific plane due to the dramatic structure differences, thus resulting in repetition development and extra complexity. Effective solution for such a multi-plane segmentation (MPS) problem is highly demanded for medical images, yet has not been well investigated. In this paper, we propose a novel solution, EchoONE, for this problem with a SAM-based segmentation architecture, a prior-composable mask learning (PC-Mask) module for semantic-aware dense prompt generation, and a learnable CNN-branch with a simple yet effective local feature fusion and adaption (LFFA) module for SAM adapting. We extensively evaluated our method on multiple internal and external echocardiography datasets, and achieved consistently state-of-the-art performance for multi-source datasets with different heart planes. This is the first time that the MPS problem is solved in one model for echocardiography data. The code will be available at https://github.com/a2503/EchoONE.
Poster
Jinho Joo · Hyeseong Kim · Hyeyeon Won · Deukhee Lee · Taejoon Eo · Dosik Hwang
[ ExHall D ]
Abstract
This study introduces a novel zero-shot scan-specific self-supervised reconstruction method for magnetic resonance imaging (MRI) to reduce scan times. Conventional supervised reconstruction methods require large amounts of fully-sampled reference data, which is often impractical to obtain and can lead to artifacts by overly emphasizing learned patterns. Existing zero-shot scan-specific methods have attempted to overcome this data dependency but show limited performance due to insufficient utilization of k-space information and constraints derived from MRI forward model. To address these limitations, we introduce a framework utilizing all acquired k-space measurements for both network inputs and training targets. While this framework suffers from training instability, we resolve these challenges through three key components: an Attention-guided K-space Selective Mechanism (AKSM) that provides indirect constraints for non-sampled k-space points, Iteration-wise K-space Masking (IKM) that enhances training stability, and a robust sensitivity map estimation model utilizing cross-channel constraint that performs effectively even at high reduction factors.Experimental results on the FastMRI knee and brain datasets with reduction factors of 4 and 8 demonstrate that the proposed method achieves superior reconstruction quality and faster convergence compared to existing zero-shot scan-specific methods, making it suitable for practical clinical applications. Our code will be released on Github.
Poster
Xinxing Cheng · Tianyang Zhang · Wenqi Lu · Qingjie Meng · Alejandro F Frangi · Jinming Duan
[ ExHall D ]
Abstract
Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds.Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code will be made publicly available.
Poster
Federico Bolelli · Kevin Marchesini · Niels van Nistelrooij · Luca Lumetti · Vittorio Pipoli · Elisa Ficarra · Shankeeth Vinayahalingam · Costantino Grana
[ ExHall D ]
Abstract
Cone-beam computed tomography (CBCT) is a standard imaging modality in orofacial and dental practices, providing essential 3D volumetric imaging of anatomical structures, including jawbones, teeth, sinuses, and neurovascular canals. Accurately segmenting these structures is fundamental to numerous clinical applications, such as surgical planning and implant placement. However, manual segmentation of CBCT scans is time-intensive and requires expert input, creating a demand for automated solutions through deep learning. Effective development of such algorithms relies on access to large, well-annotated datasets, yet current datasets are often privately stored or limited in scope and considered structures, especially concerning 3D annotations. This paper proposes a comprehensive, publicly accessible CBCT dataset with voxel-level 3D annotations of 42 distinct classes corresponding to maxillofacial structures. We validate the dataset by benchmarking state-of-the-art neural network models, including convolutional, transformer-based, and hybrid Mamba-based architectures, to evaluate segmentation performance across complex anatomical regions. Our work also explores adaptations to the nnU-Net framework to optimize multi-class segmentation for maxillofacial anatomy. The proposed dataset provides a fundamental resource for advancing maxillofacial segmentation and supports future research in automated 3D image analysis in digital dentistry.