Poster Session
Poster Session 6
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
Haotian Wang · Yuzhe Weng · Yueyan Li · Zilu Guo · Jun Du · Shutong Niu · Jiefeng Ma · Shan He · Wu Xiaoyan · Qiming Hu · Bing Yin · Cong Liu · Qingfeng Liu
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
Huaize Liu · WenZhang Sun · Donglin Di · Shibo Sun · Jiahui Yang · Hujun Bao · Changqing Zou
The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models; 3) an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals—such as audio, text, and labels—to enhance audio-driven emotion control. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
Shuling Zhao · Fa-Ting Hong · Xiaoshui Huang · Dan Xu
Talking head video generation aims to generate a realistic talking head video that preserves the person’s identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns.~Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework.Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors.
MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation
Yukang Lin · Hokit Fung · Jianjin Xu · Zeping Ren · Adela S.M. Lau · Guosheng Yin · Xiu Li
Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. In this paper, we present a novel two-stage text-guided framework, MVPortrait, to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.
Free-viewpoint Human Animation with Pose-correlated Reference Selection
Fa-Ting Hong · Zhan Xu · Haiyang Liu · Qinjie Lin · Luchuan Song · ZHIXIN SHU · Yang Zhou · Duygu Ceylan · Dan Xu
Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control. We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation. To train our model, we curated a large dataset from public TED talks featuring varied shots of the same character, helping the model learn synthesis for different perspectives. Our experimental results show that with the same number of reference images, our model performs favorably compared to the current SOTA methods under large viewpoint change. We further show that the adaptive reference selection is able to choose the most relevant reference regions to generate humans under free viewpoints.
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
Yuming Gu · Phong Tran · Yujian Zheng · Hongyi Xu · Heyuan Li · Adilbek Karmanov · Hao Li
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation.While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles.We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing
Cong Wang · Di Kang · Heyi Sun · SHENHAN QIAN · Zixuan Wang · Linchao Bao · Song-Hai Zhang
Creating high-fidelity head avatars from multi-view videos is essential for many AR/VR applications. However, current methods often struggle to achieve high-quality renderings across all head components (e.g., skin vs. hair) due to the limitations of using one single representation for elements with varying characteristics. In this paper, we introduce a Hybrid Mesh-Gaussian Head Avatar (MeGA) that models different head components with more suitable representations. Specifically, we employ an enhanced FLAME mesh for the facial representation and predict a UV displacement map to provide per-vertex offsets for improved personalized geometric details. To achieve photorealistic rendering, we use deferred neural rendering to obtain facial colors and decompose neural textures into three meaningful parts. For hair modeling, we first build a static canonical hair using 3D Gaussian Splatting. A rigid transformation and an MLP-based deformation field are further applied to handle complex dynamic expressions. Combined with our occlusion-aware blending, MeGA generates higher-fidelity renderings for the whole head and naturally supports diverse downstream tasks. Experiments on the NeRSemble dataset validate the effectiveness of our designs, outperforming previous state-of-the-art methods and enabling versatile editing capabilities, including hairstyle alteration and texture editing.
HRAvatar: High-Quality and Relightable Gaussian Head Avatar
Dongbin Zhang · Yunfei Liu · Lijian Lin · Ye Zhu · Kangjie Chen · Minghan Qin · Yu Li · Haoqian Wang
Reconstructing animatable and high-quality 3D head avatars from monocular videos, especially with realistic relighting, is a valuable task. However, the limited information from single-view input, combined with the complex head poses and facial movements, makes this challenging. Previous methods achieve real-time performance by combining 3D Gaussian Splatting with a parametric head model, but the resulting head quality suffers from inaccurate face tracking and limited expressiveness of the deformation model. These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. HRAvatar reduces tracking errors through end-to-end optimization and better captures individual facial deformations using learnable blendshapes and learnable linear blend skinning. Additionally, it decomposes head appearance into several physical properties and incorporates physically-based shading to account for environmental lighting. Extensive experiments demonstrate that HRAvatar not only reconstructs superior-quality heads but also achieves realistic visual effects under varying lighting conditions.
Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
Youyi Zhan · Tianjia Shao · Yin Yang · Kun Zhou
Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
Yiyu Zhuang · Jiaxi Lv · Hao Wen · Qing Shuai · Ailing Zeng · Hao Zhu · Shifeng Chen · Yujiu Yang · Xun Cao · Wei Liu
Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman GEnerated training dataset, HuGe100K, consisting of 100K diverse, photorealistic human images with corresponding 24-view in a static pose or dynamic pose frames generated via a pose-controllable image-to-video model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space of a given human image. This model is trained to disentangle human pose, shape, clothing geometry, and texture. Accordingly, the estimated Gaussians can be animated robustly without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the generalizable ability to efficiently reconstruct photorealistic humans in under 1 second using a single GPU. Additionally, it seamlessly supports various applications, including animation, shape, and texture editing tasks.
We study the problem of generating temporal object intrinsics—temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose—from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pretrained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan.
DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
Xinyi Zhang · Naiqi Li · Angela Dai
While remarkable success has been achived through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields.Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
Xueting Li · Ye Yuan · Shalini De Mello · Miles Macklin · Jonathan Leaf · Gilles Daviet · Jan Kautz · Umar Iqbal
We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing and human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing graphics pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first leverage two text-conditioned diffusion models to generate garment mesh and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanism for different body parts. As a result, our synthesized avatars have vivid texture and realistic dynamic motion. To the best of our knowledge, our method is the first to produce highly realistic, fully simulation-ready 3D avatars, surpassing the capabilities of current approaches.
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
Hui En Pang · Shuai Liu · Zhongang Cai · Lei Yang · Tianwei Zhang · Ziwei Liu
We present $\textbf{Disco4D}$, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. Different from existing methods, Disco4D distinctively disentangles clothings (with Gaussian models) from the human body (with SMPL-X model), significantly enhancing the generation details and flexibility. It has the following technical innovations. $\textbf{1)}$ Disco4D learns to efficiently fit the clothing Gaussians over the SMPL-X Gaussians. $\textbf{2)}$ It adopts diffusion models to enhance the 3D generation process, $\textit{e.g.}$, modeling occluded parts not visible in the input image. $\textbf{3)}$ It learns an identity encoding for each clothing Gaussian to facilitate the separation and extraction of clothing assets. Furthermore, Disco4D naturally supports 4D human animation with vivid dynamics. Extensive experiments demonstrate the superiority of Disco4D on 4D human generation and animation tasks.
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Yuze He · Yanning Zhou · Wang Zhao · Zhongkai Wu · Kaiwen Xiao · Yang Wei · Yong-Jin Liu · Xiao Han
We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which struggle with limited decomposability, unsatisfactory quality, and long optimization times, StdGEN features decomposability, effectiveness and efficiency; i.e., it generates intricately detailed 3D characters with separated semantic components such as the body, clothes, and hair, in three minutes. At the core of StdGEN is our proposed Semantic-aware Large Reconstruction Model (S-LRM), a transformer-based generalizable model that jointly reconstructs geometry, color and semantics from multi-view images in a feed-forward manner. A differentiable multi-layer semantic surface extraction scheme is introduced to acquire meshes from hybrid implicit fields reconstructed by our S-LRM. Additionally, a specialized efficient multi-view diffusion model and an iterative multi-layer surface refinement module are integrated into the pipeline to facilitate high-quality, decomposable 3D character generation. Extensive experiments demonstrate our state-of-the-art performance in 3D anime character generation, surpassing existing baselines by a significant margin in geometry, texture and decomposability. StdGEN offers ready-to-use semantic-decomposed 3D characters and enables flexible customization for a wide range of applications.
T-FAKE: Synthesizing Thermal Images for Facial Landmarking
Philipp Flotho · Moritz Piening · Anna Kukleva · Gabriele Steidl
Facial analysis is a key component in a wide range of ap-plications such as security, autonomous driving, entertainment, and healthcare. Despite the availability of various fa-cial RGB datasets, the thermal modality, which plays a crucial role in life sciences, medicine, and biometrics, has beenlargely overlooked. To address this gap, we introduce the T-FAKE dataset, a new large-scale synthetic thermal datasetwith sparse and dense landmarks. To facilitate the creationof the dataset, we propose a novel RGB2Thermal loss function, which enables the domain-adaptive transfer of thermal style to RGB faces. By utilizing the Wasserstein distance between thermal and RGB patches and the statisticalanalysis of clinical temperature distributions on faces, weensure that the generated thermal images closely resemblereal samples. Using RGB2Thermal style transfer based onour RGB2Thermal loss function, we create the large-scalesynthetic thermal T-FAKE dataset. Leveraging our novel T-FAKE dataset, probabilistic landmark prediction, and labeladaptation networks, we demonstrate significant improvements in landmark detection methods on thermal imagesacross different landmark conventions. Our models showexcellent performance with both sparse 70-point landmarksand dense 478-point landmark annotations
Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models
Jianlong Jin · Chenglong Zhao · Ruixin Zhang · Sheng Shang · Jianqing Xu · Jingyun Zhang · ShaoMing Wang · Yang Zhao · Shouhong Ding · Wei Jia · Yunsheng Wu
Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted B\'ezier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints.However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints.This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency.To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method.By applying our proposed $K$-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency.Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets.Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.Our code and pre-trained models will be released.
GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization under Sparse RGB Cameras
Hanzhang Tu · Zhanfeng Liao · Boyao Zhou · Shunyuan Zheng · Xilong Zhou · Liuxin ZHANG · QianYing Wang · Yebin Liu
We present an efficient approach for generalizable clothed human digitalization. Unlike previous methods that necessitate subject-wise optimizations or discount watertight geometry, the proposed method is dedicated to reconstruct complete human shape and Gaussian Splatting via sparse view RGB input. We extract fine-grained mesh by the combination of implicit occupancy field regression and explicit disparity estimation between views. The reconstructed high-quality geometry allows us to easily anchor Gaussian primitives according to surface normal and texture, which allows 6-DoF photorealistic novel view synthesis. Further, we introduce a simple yet effective algorithm to split Gaussian primitives in high-frequency area to enhance the visual quality. Without the assistance of templates like SMPL, our method can tackle loose clothing like dresses and costumes. To this end, we train our reconstruction pipeline on a large amount of human scan data, to achieve generalization capability across datasets. Our method outperforms recent methods in terms of novel view synthesis, while keeping high-efficiency, enabling the potential of deployment in real-time applications.
VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
Zijian He · Yuwei Ning · Yipeng Qin · Guangrun Wang · Sibei Yang · Liang Lin · Guanbin Li
Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the {\it equivalence} between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views.To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach.
BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training
Xuanpu Zhang · Dan Song · pengxin zhan · Tianyu Chang · Jianhao Zeng · Qing-Guo Chen · Weihua Luo · An-An Liu
Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of the specific person.Recent methods model virtual try-on as image mask-inpaint task, which requires masking the person image and results in significant loss of spatial information. Especially, for in-the-wild try-on scenarios with complex poses and occlusions, mask-based methods often introduce noticeable artifacts. Our research found that a mask-free approach can fully leverage spatial and lighting information from the original person image, enabling high-quality virtual try-on. Consequently, we propose a novel training paradigm for a mask-free try-on diffusion model. We ensure the model's mask-free try-on capability by creating high-quality pseudo-data and further enhance its handling of complex spatial information through effective in-the-wild data augmentation. Besides, a try-on localization loss is designed to concentrate on try-on area while suppressing garment features in non-try-on areas, ensuring precise rendering of garments and preservation of fore/back-ground. In the end, we introduce BooW-VTON, the mask-free virtual try-on diffusion model, which delivers SOTA try-on quality without parsing cost. Extensive qualitative and quantitative experiments have demonstrated superior performance in wild scenarios with such a low-demand input.
SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images
Daisheng Jin · Jiangbei Hu · Baixin Xu · Yuxin Dai · Chen Qian · Ying He
In this study, we introduce a novel two-stage technique for decomposing and reconstructing facial features from sparse-view images, a task made challenging by the unique geometry and complex skin reflectance of each individual. To synthesize 3D facial models more realistically, we endeavor to decouple key facial attributes from the RGB color, including geometry, diffuse reflectance, and specular reflectance. Specifically, we design a Sparse-view Face Decomposition Model (SFDM): 1) In the first stage, we create a general facial template from a wide array of individual faces, encapsulating essential geometric and reflectance characteristics. 2) Guided by this template, we refine a specific facial model for each individual in the second stage, considering the interaction between geometry and reflectance, as well as the effects of subsurface scattering on the skin. With these advances, our method can reconstruct high-quality facial representations from as few as three images. The comprehensive evaluation and comparison reveal that our approach outperforms existing methods by effectively disentangling geometric and reflectance components, significantly enhancing the quality of synthesized novel views, and paving the way for applications in facial relighting and reflectance editing. The code will be made available to the public.
Integral Fast Fourier Color Constancy
Wenjun Wei · Yanlin Qian · Huaian Chen · Junkang Dai · Yi Jin
Traditional auto white balance (AWB) algorithms typically assume a single global illuminant source, which leads to color distortions in multi-illuminant scenes. While recent neural network-based methods have shown excellent accuracy in such scenarios, their high parameter count and computational demands limit their practicality for real-time video applications. The Fast Fourier Color Constancy (FFCC) algorithm was proposed for single-illuminant-source scenes, predicting a global illuminant source with high efficiency. However, it cannot be directly applied to multi-illuminant scenarios unless specifically modified. To address this, we propose Integral Fast Fourier Color Constancy (IFFCC), an extension of FFCC tailored for multi-illuminant scenes. IFFCC leverages the proposed integral UV histogram to accelerate histogram computations across all possible regions in Cartesian space and parallelizes Fourier-based convolution operations, resulting in a spatially-smooth illumination map. This approach enables high-accuracy, real-time AWB in multi-illuminant scenes. Extensive experiments show that IFFCC achieves accuracy that is on par with or surpasses that of pixel-level neural networks, while reducing the parameter count by over 400× and processing speed by 20 - 100× faster than network-based approaches.
Reversible Decoupling Network for Single Image Reflection Removal
Hao Zhao · Mingjia Li · Qiming Hu · Xiaojie Guo
Recent deep-learning-based approaches to single-image reflection removal have shown promising advances, primarily for two reasons: 1) the utilization of recognition-pretrained features as inputs, and 2) the design of dual-stream interaction networks. However, according to the Information Bottleneck principle, high-level semantic clues tend to be compressed or discarded during layer-by-layer propagation. Additionally, interactions in dual-stream networks follow a fixed pattern across different layers, limiting overall performance. To address these limitations, we propose a novel architecture called Reversible Decoupling Network (RDNet), which employs a reversible encoder to secure valuable information while flexibly decoupling transmission- and reflection-relevant features during the forward pass. Furthermore, we customize a transmission-rate-aware prompt generator to dynamically calibrate features, further boosting performance. Extensive experiments demonstrate the superiority of RDNet over existing SOTA methods on five widely-adopted benchmark datasets. Our code will be made publicly available.
Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning
Shouhang Zhu · Chenglin Li · Yuankun Jiang · Li Wei · Nuowen Kan · Ziyang Zheng · Wenrui Dai · Junni Zou · Hongkai Xiong
Autofocus is a crucial component of modern digital cameras. While recent learning-based methods achieve state-of-the-art in-focus prediction accuracy, they unfortunately ignore the potential focus hunting phenomenon of back-and-forth lens movement in the multi-step focusing procedure. To address this, in this paper, we propose an expert regularized deep reinforcement learning (DRL)-based approach for autofocus, which is able to utilize the sequential information of lens movement trajectory to both enhance the multi-step in-focus prediction accuracy and reduce the chance of focus hunting. Our method generally follows an actor-critic framework. To accelerate the DRL's training with a higher sample efficiency, we initialize the policy with a pre-trained single-step prediction network, where the network is further improved by modifying the output of absolute in-focus position distribution to the relative lens movement distribution to establish a better mapping between input images and lens movement. To further stabilize DRL's training with lower focus hunting occurrence in the resulting lens movement trajectory, we generate some offline trajectories based on the prior knowledge to avoid focus hunting, which are then leveraged as an offline dataset of expert trajectories to regularize actor network's training. Empirical evaluations show that our method outperforms those learning-based methods on public benchmarks, with higher single- and multi-step prediction accuracies, and a significant reduction of focus hunting rate.
V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy
Jiayin Zhao · Zhenqi Fu · Tao Yu · Hui Qiao
Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, current LFM reconstruction algorithms are highly sensitive to sensor noise and lack robustness when applied to experimental data. To address these challenges, this paper presents an unsupervised view-to-view LFM 3D reconstruction framework, named V2V3D. Unlike existing methods that directly use all views for reconstruction, V2V3D divides the views into two subsets, with each subset generating corresponding volumes and working together to effectively remove sensor noise. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset generated using two-photon excitation, including both the light field images and the corresponding 3D intensity volumes. Extensive experiments demonstrate that our unsupervised approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions. Our code and dataset will be made publicly available.
DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting
Liao Shen · Tianqi Liu · Huiqiang Sun · Jiaqi Li · Zhiguo Cao · Wei Li · Chen Change Loy
Recent advances in 3D Gaussian Splatting (3D-GS) have shown remarkable success in representing 3D scenes and generating high-quality, novel views in real-time. However, 3D-GS and its variants assume that input images are captured based on pinhole imaging and are fully in focus. This assumption limits their applicability, as real-world images often feature shallow depth-of-field (DoF). In this paper, we introduce DoF-Gaussian, a controllable depth-of-field method for 3D-GS. We develop a lens-based imaging model based on geometric optics principles to control DoF effects. To ensure accurate scene geometry, we incorporate depth priors adjusted per scene, and we apply defocus-to-focus adaptation to minimize the gap in the circle of confusion. We also introduce a synthetic dataset to assess refocusing capabilities and the model’s ability to learn precise lens parameters. Our framework is customizable and supports various interactive applications. Extensive experiments confirm the effectiveness of our method. Code and the dataset will be made publicly available.
Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
Ziteng Cui · Xuangeng Chu · Tatsuya Harada
Capturing high-quality photographs across diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) strongly influence image quality. This difficulty intensifies in multi-view scenarios, where each viewpoint can have distinct lighting and image signal processor (ISP) settings, causing photometric inconsistencies between views. These lighting degradations and view variations significant challenges to both NeRF- and 3D Gaussian Splatting (3DGS)-based novel view synthesis (NVS) frameworks.To address this, we introduce Luminance-GS, a novel approach to achieve high-quality novel view synthesis results under diverse and challenging lighting conditions using 3DGS. By adopting per-view color space mapping and view adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions—including low-light, overexposure, and varying exposure—without altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality. We would release the source code.
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
Qi Wu · Janick Martinez Esturo · Ashkan Mirzaei · Nicolas Moënne-Loccoz · Žan Gojčič
3D Gaussian Splatting (3DGS) has shown great potential for efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware. However, due to its rasterization-based formulation, 3DGS is constrained to ideal pinhole cameras and lacks support for secondary lighting effects. Recent methods address these limitations by tracing volumetric particles instead, however, this comes at the cost of significantly slower rendering speeds. In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation in 3DGS with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. This modification enables trivial support of distorted cameras with time dependent effects such as rolling shutter, while retaining the efficiency of rasterization. Additionally, we align our rendering formulation with that of tracing-based methods, enabling secondary ray tracing required to represent phenomena such as reflections and refraction within the same 3D representation.
Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
Ruofan Liang · Žan Gojčič · Huan Ling · Jacob Munkberg · Jon Hasselgren · Chih-Hao Lin · Jun Gao · Alexander Keller · Nandita Vijaykumar · Sanja Fidler · Zian Wang
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Specifically, we first train a video diffusion model for inverse rendering on synthetic data, which generalizes well to real-world videos and allows us to auto-label diverse real-world videos. We then co-train our rendering model using both synthetic and auto-labeled real-world data. Experiments demonstrate that Diffusion Renderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input—including relighting, material editing, and realistic object insertion.
Ref-GS: Directional Factorization for 2D Gaussian Splatting
Youjia Zhang · Anpei Chen · Yumin Wan · Zikai Song · Junqing Yu · Yawei Luo · Wei Yang
In this paper, we introduce $\textit{Ref-GS}$, a novel approach for directional light factorization in 2D Gaussian splatting, which enables photorealistic view-dependent appearance rendering and precise geometry recovery. $\textit{Ref-GS}$ builds upon the deferred rendering of Gaussian splatting and applies directional encoding to the deferred-rendered surface, effectively reducing the ambiguity between orientation and viewing angle. Next, we introduce a spherical mip-grid to capture varying levels of surface roughness, enabling roughness-aware Gaussian shading. Additionally, we propose a simple yet efficient geometry-lighting factorization that connects geometry and lighting via the vector outer product, significantly reducing renderer overhead when integrating volumetric attributes. Our method achieves superior photorealistic rendering for a range of open-world scenes while also accurately recovering geometry.
NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics
Chenhao Li · Taishi Ono · Takeshi Uemori · Sho Nitta · Hajime Mihara · Alexander Gatto · Hajime Nagahara · Yusuke Moriuchi
Recent inverse rendering methods have greatly improved shape, material, and illumination reconstruction by utilizing polarization cues. However, existing methods only support dielectrics, ignoring conductors that are found everywhere in life. Since conductors and dielectrics have different reflection properties, using previous conductor methods will lead to obvious errors. In addition, conductors are glossy, which may cause strong specular reflection and is hard to reconstruct. To solve the above issues, we propose NeISF++, an inverse rendering pipeline that supports conductors and dielectrics. The key ingredient for our proposal is a general pBRDF that describes both conductors and dielectrics. As for the strong specular reflection problem, we propose a novel geometry initialization method using DoLP images. This physical cue is invariant to intensities and thus robust to strong specular reflections. Experimental results on our synthetic and real datasets show that our method surpasses the existing polarized inverse rendering methods for geometry and material decomposition as well as downstream tasks like relighting.
FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
Yue Gao · Hong-Xing Yu · Bo Zhu · Jiajun Wu
We study reconstructing and predicting 3D fluid appearance and velocity from a single video. Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. Our key insight is to synthesize multiple novel-view videos as references for reconstruction. FluidNexus consists of two key components: (1) a novel-view video synthesizer that combines frame-wise view synthesis with video diffusion refinement for generating realistic videos, and (2) a physics-integrated particle representation coupling differentiable simulation and rendering to simultaneously facilitate 3D fluid reconstruction and prediction. To evaluate our approach, we collect two new real-world fluid datasets featuring textured backgrounds and object interactions. Our method enables dynamic novel view synthesis, future prediction, and interaction simulation from a single fluid video. we will release code and datasets.
Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion
ZhiFei Chen · Tianshuo Xu · Wenhang Ge · Leyi Wu · Dongyu Yan · Jing He · Luozhou Wang · Lu Zeng · Shunsi Zhang · Ying-Cong Chen
Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images.Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrating a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
Zexin He · Tengfei Wang · Xin Huang · Xingang Pan · Ziwei Liu
Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Our code and dataset will be made publicly available.
RNG: Relightable Neural Gaussians
Jiahui Fan · Fujun Luan · Jian Yang · Milos Hasan · Beibei Wang
3D Gaussian Splatting (3DGS) has shown impressive results for the novel view synthesis task, where lighting is assumed to be fixed. However, creating relightable 3D assets, especially for objects with ill-defined shapes (fur, fabric, etc.), remains a challenging task. The decomposition between light, geometry, and material is ambiguous, especially if either smooth surface assumptions or surface-based analytical shading models do not apply. We propose Relightable Neural Gaussians (RNG), a novel 3DGS-based framework that enables the relighting of objects with both hard surfaces or soft boundaries, while avoiding assumptions on the shading model. We condition the radiance at each point on both view and light directions. We also introduce a shadow cue, as well as a depth refinement network to improve shadow accuracy. Finally, we propose a hybrid forward-deferred fitting strategy to balance geometry and appearance quality. Our method achieves significantly faster training (1.3 hours) and rendering (60 frames per second) compared to a prior method based on neural radiance fields and produces higher-quality shadows than a concurrent 3DGS-based method.
SGSST: Scaling Gaussian Splatting Style Transfer
Bruno Galerne · Jianling WANG · Lara Raad · Jean-michel Morel
Applying style transfer to a full 3D environment is a challenging task that has seen many developments since the advent of neural rendering. 3D Gaussian splatting (3DGS) has recently pushed further many limits of neural rendering in terms of training speed and reconstruction quality. This work introduces SGSST: Scaling Gaussian Splatting Style Transfer, an optimization-based method to apply style transfer to pretrained 3DGS scenes. We demonstrate that a new multiscale loss based on global neural statistics, that we name SOS for Simultaneously Optimized Scales, enables style transfer to ultra-high resolution 3D scenes. Not only SGSST pioneers 3D scene style transfer at such high image resolutions, it also produces superior visual quality as assessed by thorough qualitative, quantitative and perceptual comparisons.
Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation
Chuhao Chen · Zhiyang Dou · Chen Wang · Yiming Huang · Anjun Chen · Qiao Feng · Jiatao Gu · Lingjie Liu
Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing system identification in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, \name, a generalizable video-based approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, \name first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, \name enables high-quality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data. Our code and models will be publicly available upon acceptance.
Material Anything: Generating Materials for Any 3D Object via Diffusion
Xin Huang · Tengfei Wang · Ziwei Liu · Qing Wang
We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.
TexGarment: Consistent Garment UV Texture Generation via Efficient 3D Structure-Guided Diffusion Transformer
Jialun Liu · Jinbo Wu · Xiaobo Gao · JiaKui Hu · Bojun Xiong · Xing Liu · Chen Zhao · Hongbin Pei · Haocheng Feng · Yingying Li · Errui Ding · Jingdong Wang
This paper introduces TexGarment, an efficient method for synthesizing high-quality, 3D-consistent garment textures in UV space. Traditional approaches based on 2D-to-3D mapping often suffer from 3D inconsistency, while methods learning from limited 3D data lack sufficient texture diversity. These limitations are particularly problematic in garment texture generation, where high demands exist for both detail and variety. To address these challenges, TexGarment leverages a pre-trained text-to-image diffusion Transformer model with robust generalization capabilities, introducing structural information to guide the model in generating 3D-consistent garment textures in a single inference step. Specifically, We utilize the 2D UV position map to guide the layout during the UV texture generation process, ensuring a coherent texture arrangement and enhancing it by integrating global 3D structural information from the mesh surface point cloud. This combined guidance effectively aligns 3D structural integrity with 2D layout. Our method efficiently generates high-quality, diverse UV textures in a single inference step while maintaining 3D consistency. Experimental results validate the effectiveness of TexGarment, achieving state-of-the-art performance in 3D garment texture generation.
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Zhaoxi Chen · Jiaxiang Tang · Yuhao Dong · Ziang Cao · Fangzhou Hong · Yushi Lan · Tengfei Wang · Haozhe Xie · Tong Wu · Shunsuke Saito · Liang Pan · Dahua Lin · Ziwei Liu
The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. Extensive qualitative and quantitative experiments are conducted to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.
BrepGiff: Lightweight Generation of Complex B-rep with 3D GAT Diffusion
Hao Guo · Xiaoshui Huang · Hao jiacheng · Yunpeng Bai · Hongping Gan · Yilei Shi
Despite advancements in Computer-Aided-Design (CAD) generation, direct generation of complex Boundary Representation (B-rep) CAD models remains challenging. This difficulty arises from the parametric nature of B-rep data, complicating the encoding and generation of its geometric and topological information. To address this, we introduce BrepGiff, a lightweight generation approach for high-quality and complex B-rep model based on 3D Graph Diffusion. First, we transfer B-rep models into 3D graphs representation. Specifically, BrepGiff extracts and integrates topological and geometric features to construct a 3D graph where nodes correspond to face centroids in 3D space, preserving adjacency and degree information. Geometric features are derived by sampling points in the UV domain and extracting face and edge features. Then, BrepGiff applies a Graph Attention Network (GAT) to enforce topological constraints from local to global during the degree-guided diffusion process. With the 3D graph representation and efficient diffusion process, our method significantly reduces the computational cost and improves the quality, thus achieving lightweight generation of complex models. Experiments show that BrepGiff can generate complex B-rep models ($>$100 faces) using only 2 RTX4090 GPUs, achieving state-of-the-art performance in B-rep generation.
Towards Realistic Example-based Modeling via 3D Gaussian Stitching
Xinyu Gao · Ziyi Yang · Bingchen Gong · Xiaoguang Han · Sipeng Yang · Xiaogang Jin
Using parts of existing models to rebuild new models, commonly termed as example-based modeling, is a classical methodology in the realm of computer graphics. Previous works mostly focus on shape composition, making them very hard to use for realistic composition of 3D objects captured from real-world scenes. This leads to combining multiple NeRFs into a single 3D scene to achieve seamless appearance blending. However, the current SeamlessNeRF method struggles to achieve interactive editing and harmonious stitching for real-world scenes due to its gradient-based strategy and grid-based representation.To this end, we present an example-based modeling method that combines multiple Gaussian fields in a point-based representation using sample-guided synthesis. Specifically, as for composition, we create a GUI to segment and transform multiple fields in real time, easily obtaining a semantically meaningful composition of models represented by 3D Gaussian Splatting (3DGS). For texture blending, due to the discrete and irregular nature of 3DGS, straightforwardly applying gradient propagation as SeamlssNeRF is not supported. Thus, a novel sampling-based cloning method is proposed to harmonize the blending while preserving the original rich texture and content. Our workflow consists of three steps: 1) real-time segmentation and transformation of a Gaussian model using a well-tailored GUI, 2) KNN analysis to identify boundary points in the intersecting area between the source and target models, and 3) two-phase optimization of the target model using sampling-based cloning and gradient constraints. Extensive experimental results validate that our approach significantly outperforms previous works in terms of realistic synthesis, demonstrating its practicality.
TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
Stefan Lionar · Jiabin Liang · Gim Hee Lee
We introduce TreeMeshGPT, an autoregressive Transformer designed to generate high-quality artistic meshes aligned with input point clouds. Instead of the conventional next-token prediction in autoregressive Transformer, we propose a novel Autoregressive Tree Sequencing where the next input token is retrieved from a dynamically growing tree structure that is built upon the triangle adjacency of faces within the mesh. Our sequencing enables the mesh to extend locally from the last generated triangular face at each step, and therefore reduces training difficulty and improves mesh quality. Our approach represents each triangular face with two tokens, achieving a compression rate of approximately 22% compared to the naive face tokenization. Due to this efficient tokenization technique, we push the boundary of artistic mesh generation to the face limit of 5,500 triangles with a strong point cloud condition of 2,048 tokens, surpassing previous methods. Furthermore, our method generates mesh with strong normal orientation constraints, minimizing flipped normals commonly encountered in previous methods. Our experiments show that TreeMeshGPT enhances the mesh generation quality with refined details and normal orientation consistency.
GenVDM: Generating Vector Displacement Maps From a Single Image
Yuezhi Yang · Qimin Chen · Vladimir G. Kim · Siddhartha Chaudhuri · Qixing Huang · Zhiqin Chen
We introduce the first method for generating Vector Displacement Maps (VDMs): parameterized, detailed geometric stamps commonly used in 3D modeling. Given a single input image, our method first generates multi-view normal maps and then reconstructs a VDM from the normals via a novel reconstruction pipeline. We also propose an efficient algorithm for extracting VDMs from 3D objects, and present the first academic VDM dataset. Compared to existing 3D generative models focusing on complete shapes, we focus on generating parts that can be seamlessly attached to shape surfaces. The method gives artists rich control over adding geometric details to a 3D shape. Experiments demonstrate that our approach outperforms existing baselines. Generating VDMs offers additional benefits, such as using 2D image editing to customize and refine 3D details.
CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion
Kai He · Chin-Hsuan Wu · Igor Gilitschenski
Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to ``learn'' the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.
LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians
Jiamin WU · Kenkun Liu · Han Gao · Xiaoke Jiang · Yuan Yao · Lei Zhang
Rencently, Gaussian splatting has demonstrated significant success in novel view synthesis. Current methods often regress Gaussians with pixel or point cloud correspondence, linking each Gaussian with a pixel or a 3D point. This leads to the redundancy of Gaussians being used to overfit the correspondence rather than the objects represented by the 3D Gaussians themselves, consequently wasting resources and lacking accurate geometries or textures.In this paper, we introduce LeanGaussian, a novel approach that treats each query in deformable Transformer as one 3D Gaussian ellipsoid, breaking the pixel or point cloud correspondence constraints. We leverage deformable decoder to iteratively refine the Gaussians layer-by-layer with the image features as keys and values.Notably, the center of each 3D Gaussian is defined as 3D reference points, which are then projected onto the image for deformable attention in 2D space.On both the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level, trained with the Objaverse dataset), our approach, outperforms prior methods by approximately 6.1\%, achieving a PSNR of 25.44 and 22.36, respectively. Additionally, our method achieves a 3D reconstruction speed of 7.2 FPS and rendering speed 500 FPS.
FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
Guofeng Feng · Siyan Chen · Rong Fu · Zimu Liao · Yi Wang · Tao Liu · Boni Hu · Linning Xu · PeiZhilin · Hengjie Li · Xiuhong Li · Ninghui Sun · Xingcheng Zhang · Bo Dai
Recently the remarkable progress in 3D Gaussian Splatting (3DGS) has demonstrated huge potential over traditional rendering techniques, attracting significant attention from both industry and academia. Due to the presence of numerous anisotropic Gaussian representations in large-scale and high-resolution scenes, real-time rendering with 3DGS remains a challenging problem and is also rarely studied. We proposed FlashGS, an open-source CUDA library with Python bind, with comprehensive algorithm design and optimizations, encompassing redundancy elimination, adaptive scheduling, and efficient pipelining. We first eliminate substantial redundant tasks through precise Gaussian intersection tests, considering the essence of the 3DGS rasterizer. During task partitioning, we introduce an adaptive scheduling strategy that accounts for variations in the size and shape of Gaussians. We also design a multi-stage pipeline strategy for color computations in rendering, further accelerating the process. An extensive evaluation of FlashGS has been conducted across a diverse spectrum of synthetic and real-world 3D scenes, covering a variety of scene sizes up to 2.7 km$^2$ cityscape and resolutions up to 4K. We achieve up to 30.53$\times$ faster than 3DGS with an average of $7.2\times$, rendering at a minimum of 125.9 FPS, achieving state-of-the-art performance.
Steepest Descent Density Control for Compact 3D Gaussian Splatting
Peihao Wang · Yuehao Wang · Dilin Wang · Sreyas Mohan · Zhiwen Fan · Lemeng Wu · Ruisi Cai · Yu-Ying Yeh · Zhangyang Wang · Qiang Liu · Rakesh Ranjan
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis.By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points.However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands--posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce **SteepGS**, incorporating *steepest density control*, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a $\sim$ 50\% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.
GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting
Yangming Zhang · Wenqi Jia · Wei Niu · Miao Yin
3D Gaussian Splatting (3DGS) has emerged as a mainstream for novel view synthesis, leveraging continuous aggregations of Gaussian functions to model scene geometry. However, 3DGS suffers from substantial memory requirements to store the large amount of Gaussians, hindering its efficiency and practicality. To address this challenge, we introduce GaussianSpa, an optimization-based simplification framework for compact and high-quality 3DGS. Specifically, we formulate the simplification objective as a constrained optimization problem associated with the 3DGS training. Correspondingly, we propose an efficient "optimizing-sparsifying" solution for the formulated problem, alternately solving two independent sub-problems and gradually imposing substantial sparsity onto the Gaussians in the 3DGS training process. We conduct quantitative and qualitative evaluations on various datasets, demonstrating the superiority of GaussianSpa over existing state-of-the-art approaches. Notably, GaussianSpa achieves an average PSNR improvement of 0.9 dB on the real-world Deep Blending dataset with 10$\times$ fewer Gaussians compared to the vanilla 3DGS.
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
Seungtae Nam · Xiangyu Sun · Gyeongjin Kang · Younggeun Lee · Seungjun Oh · Eunbyung Park
Generalized feed-forward Gaussian models have shown remarkable progress in sparse-view 3D reconstruction, leveraging prior knowledge learned from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of generated Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be extended and applied to the feed-forward models, it may not be ideally suited for generalized settings. In this paper, we present Generative Densification, an efficient and generalizable densification strategy that can selectively generate fine Gaussians for high-fidelity 3D reconstruction. Unlike the 3D-GS densification strategy, we densify the feature representations from the feed-forward models rather than the raw Gaussians, making use of the prior knowledge embedded in the features for enhanced generalization. Experimental results demonstrate the effectiveness of our approach, achieving the state-of-the-art rendering quality in both object-level and scene-level reconstruction, with noticeable improvements in representing fine details.
IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement
Zhihao Shi · Dong Huo · Yuhongze Zhou · Yan Min · Juwei Lu · Xinxin Zuo
Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted.To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model.Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
Sheng-Yu Huang · Zi-Ting Chou · Yu-Chiang Frank Wang
When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes. Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
Rundi Wu · Ruiqi Gao · Ben Poole · Alex Trevithick · Changxi Zheng · Jonathan T. Barron · Aleksander Holynski
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos.
HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
Xinpeng Liu · Zeyi Huang · Fumio Okura · Yasuyuki Matsushita
Novel view synthesis has demonstrated impressive progress recently, with 3D Gaussian splatting (3DGS) offering efficient training time and photorealistic real-time rendering. However, reliance on Cartesian coordinates limits 3DGS's performance on distant objects, which is important for reconstructing unbounded outdoor environments. We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. HoGS effectively manages both expansive spatial positions and scales particularly in outdoor unbounded environments by adopting projective geometry principles. Experiments show that HoGS significantly enhances accuracy in reconstructing distant objects while maintaining high-quality rendering of nearby objects, along with fast training speed and real-time rendering capability. Our implementation will be released upon acceptance.
Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
Zilong Huang · Jun He · Junyan Ye · Lihan Jiang · Weijia Li · Yiping Chen · Ting Han
The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Our Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be made publicly available.
Learning Partonomic 3D Reconstruction from Image Collections
Xiaoqian Ruan · Pei Yu · Dian Jia · Hyeonjeong Park · Peixi Xiong · Wei Tang
Reconstructing the 3D shape of an object from a single-view image is a fundamental task in computer vision. Recent advances in differentiable rendering have enabled 3D reconstruction from image collections using only 2D annotations. However, these methods mainly focus on whole-object reconstruction and overlook object partonomy, which is essential for intelligent agents interacting with physical environments. This paper aims at learning partonomic 3D reconstruction from collections of images with only 2D annotations. Our goal is not only to reconstruct the shape of an object from a single-view image but also to decompose the shape into meaningful semantic parts. To handle the expanded solution space and frequent part occlusions in single-view images, we introduce a novel approach that represents, parses, and learns the structural compositionality of 3D objects. This approach comprises: (1) a compact and expressive compositional representation of object geometry, achieved through disentangled modeling of large shape variations, constituent parts, and detailed part deformations as multi-granularity neural fields; (2) a part transformer that recovers precise partonomic geometry and handles occlusions, through effective part-to-pixel grounding and part-to-part relational modeling; and (3) a self-supervised method that jointly learns the compositional representation and part transformer, by bridging object shape and parts, image synthesis, and differentiable rendering. Extensive experiments on ShapeNetPart, PartNet, and CUB-200-2011 demonstrate the effectiveness of our approach on both overall and partonomic reconstruction. We will make our code and data publicly available.
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Jay Zhangjie Wu · Yuxuan Zhang · Haithem Turki · Xuanchi Ren · Jun Gao · Mike Zheng Shou · Sanja Fidler · Žan Gojčič · Huan Ling
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2x improvement in FID score over baselines while maintaining 3D consistency.
Novel view synthesis from limited observations remains a significant challenge due to the lack of information in under-sampled regions, often resulting in noticeable artifacts. We introduce Generative Sparse-view Gaussian Splatting (GS-GS), a general pipeline designed to enhance the rendering quality of 3D/4D Gaussian Splatting (GS) when training views are sparse. Our method generates unseen views using generative models, specifically leveraging pre-trained image diffusion models to iteratively refine view consistency and hallucinate additional images at pseudo views. This approach improves 3D/4D scene reconstruction by explicitly enforcing semantic correspondences during the generation of unseen views, thereby enhancing geometric consistency—unlike purely generative methods that often fail to maintain view consistency. Extensive evaluations on various 3D/4D datasets—including Blender, LLFF, Mip-NeRF360, and Neural 3D Video—demonstrate that our GS-GS outperforms existing state-of-the-art methods in rendering quality without sacrificing efficiency.
Novel View Synthesis with Pixel-Space Diffusion Models
Noam Elata · Bahjat Kawar · Yaron Ostrovsky-Berman · Miriam Farber · Ron Sokolovsky
Synthesizing a novel view from a single input image is a challenging task.Traditionally, this task was approached by estimating scene depth, warping, and inpainting, with machine learning models enabling parts of the pipeline.More recently, generative models are being increasingly employed in novel view synthesis (NVS), often encompassing the entire end-to-end system.In this work, we adapt a modern diffusion model architecture for end-to-end NVS in the pixel space, substantially outperforming previous state-of-the-art (SOTA) techniques.We explore different ways to encode geometric information into the network.Our experiments show that while these methods may enhance performance, their impact is minor compared to utilizing improved generative models.Moreover, we introduce a novel NVS training scheme that utilizes single-view datasets, capitalizing on their relative abundance compared to their multi-view counterparts.This leads to improved generalization capabilities to scenes with out-of-domain content.We plan to publish code and model weights upon acceptance.
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu · Yixin Chen · Junfeng Ni · Baoxiong Jia · Yu Liu · Diwen Wan · Gang Zeng · Siyuan Huang
Repurposing pre-trained diffusion models has been proven to be effective for novel view synthesis (NVS). However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object novel view synthesis (NVS) in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model's capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.
CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis
Youngkyoon Jang · Eduardo Pérez-Pellitero
We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision with a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty levels; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with various sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.
Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes
Lihan Jiang · Kerui Ren · Mulin Yu · Linning Xu · Junting Dong · Tao Lu · Feng Zhao · Dahua Lin · Bo Dai
Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. Our method addresses the key challenges of combining these perspectives with a new training strategy, overcoming viewpoint discrepancies to generate high-fidelity scenes. We also curated a high-quality aerial-to-ground view dataset encompassing both synthetic and real-world scene to advance further research. Experiments across diverse urban scene datasets confirms the effectiveness of our method.
NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
Yulong Zheng · Zicheng Jiang · Shengfeng He · Yandu Sun · Junyu Dong · Huaidong Zhang · Yong Du
Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints. However, these methods struggle in few-shot scenarios due to limited supervision. In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting. Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow. By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions. Experiments demonstrate that NexusGS significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin. Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods.
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
Yutao Tang · Yuxiang Guo · Deming Li · Cheng Peng
Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Shangjin Zhai · Zhichao Ye · Jialin Liu · Weijian Xie · Jiaqi Hu · Zhen Peng · Hua Xue · Danpeng Chen · Xiaomeng Wang · Lei Yang · Nan Wang · Haomin Liu · Guofeng Zhang
Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. Each video clip generation is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from the last generated clip, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction
Mingzhi Pei · Xu Cao · Xiangyi Wang · Heng Guo · Zhanyu Ma
Multi-view 3D reconstruction for reflective and textureless surfaces remains a challenging problem. Both camera pose calibration and shape reconstruction fail due to insufficient or unreliable visual features across views. To address these issues, we present PMNI (Pose-free Multiview Normal Integration), a novel neural surface reconstruction method that leverages surface normal maps instead of RGB images to incorporate rich geometric information. By enforcing geometric constraints from surface normals and multiview shape consistency within a neural signed distance function (SDF) optimization framework, PMNI robustly recovers both camera poses and high-fidelity surface geometry simultaneously. Experimental results on synthetic and real-world datasets show that our method achieves state-of-the-art performance in the reconstruction of reflective surfaces, even without reliable initial camera poses.
Learnable Infinite Taylor Gaussian for Dynamic View Rendering
Bingbing Hu · Yanyan Li · rui xie · Bo Xu · Haoye Dong · Junfeng Yao · Gim Hee Lee
Capturing the temporal evolution of Gaussian properties such as position, rotation, and scale is a challenging task due to the vast number of time-varying parameters and the limited photometric data available, which generally results in convergence issues, making it difficult to find an optimal solution. While feeding all inputs into an end-to-end neural network can effectively model complex temporal dynamics, this approach lacks explicit supervision and struggles to generate high-quality transformation fields. On the other hand, using time-conditioned polynomial functions to model Gaussian trajectories and orientations provides a more explicit and interpretable solution, but requires significant handcrafted effort and lacks generalizability across diverse scenes. To overcome these limitations, this paper introduces a novel approach based on a learnable infinite Taylor Formula to model the temporal evolution of Gaussians. This method offers both the flexibility of an implicit network-based approach and the interpretability of explicit polynomial functions, allowing for more robust and generalizable modeling of Gaussian dynamics across various dynamic scenes.Extensive experiments on dynamic novel view rendering task are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in this domain.
Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation
Joohyun Kwon · Hanbyel Cho · Junmo Kim
Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which handles dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field potentially resulting from the editing process, we additionally conducted a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that our method is efficient, reducing editing time by more than half compared to existing methods, while achieving high editing quality that better follows user instructions.
SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video
Jongmin Park · Minh-Quan Viet Bui · Juan Luis Gonzalez Bello · Jaeho Moon · Jihyong Oh · Munchurl Kim
Synthesizing novel views from in-the-wild monocular videos is challenging due to scene dynamics and the lack of multi-view cues. To address this, we propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting (3DGS) framework for high-quality reconstruction and fast rendering from monocular videos. At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories using cubic Hermite splines with a small number of control points. For MAS, we introduce a Motion-Adaptive Control points Pruning (MACP) method to model the deformation of each dynamic 3D Gaussian across varying motions, progressively pruning control points while maintaining dynamic modeling integrity. Additionally, we present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency. This eliminates the need for Structure-from-Motion preprocessing and enhances SplineGS’s robustness in real-world conditions. Experiments show that SplineGS significantly outperforms state-of-the-art methods in novel view synthesis quality for dynamic scenes from monocular videos, achieving thousands times faster rendering speed.
EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering
Toshiya Yura · Ashkan Mirzaei · Igor Gilitschenski
We introduce a method for using event camera data in novel view synthesis via Gaussian Splatting.Event cameras offer exceptional temporal resolution and a high dynamic range. Leveraging these capabilities allows us to effectively address the novel view synthesis challenge in the presence of fast camera motion.For initialization of the optimization process, our approach uses prior knowledge encoded in an event-to-video model. We also use spline interpolation for obtaining high quality poses along the event camera trajectory. This enhances the reconstruction quality from fast-moving cameras while overcoming the computational limitations traditionally associated with event-based Neural Radiance Field (NeRF) methods. Our experimental evaluation demonstrates that our results achieve higher visual fidelity and better performance than existing event-based NeRF approaches while being an order of magnitude faster to render.
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Jianping Jiang · Weiye Xiao · Zhengyu Lin · Huaizhong Zhang · Tianxiang Ren · Yang Gao · Zhiqian Lin · Zhongang Cai · Lei Yang · Ziwei Liu
Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: 1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. 2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. 3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
Denoising Functional Maps: Diffusion Models for Shape Correspondence
Aleksei Zhuravlev · Zorah Lähner · Vladislav Golyanik
Estimating correspondences between pairs of deformable shapes remains challenging. Despite substantial progress, existing methods lack broad generalization capabilities and require domain-specific training data. To address these limitations, we propose a fundamentally new approach to shape correspondence based on denoising diffusion models. In our method, a diffusion model learns to directly predict the functional map, i.e. a low-dimensional representation for a point-wise map between shapes. We use a large dataset of synthetic human meshes for training and apply two steps to reduce the number of functional maps that need to be learned. First, maps refer to a template rather than to shape pairs. Second, a functional map is defined in the basis of eigenvectors of the Laplacian, which is not unique due to sign ambiguity. We, hence, introduce an unsupervised approach to select a specific basis by correcting the signs of eigenvectors based on surface features. Our approach achieves superior performance on standard human datasets, meshes with anisotropic connectivity, and non-isometric humanoid shapes compared to existing descriptor-based and large-scale shape deformation methods. We will release the source code and the datasets for reproducibility and research purposes.
Event Fields: Capturing Light Fields at High Speed, Resolution, and Dynamic Range
Ziyuan Qu · Zihao Zou · Vivek Boominathan · Praneeth Chakravarthula · Adithya Pediredla
Event cameras, which feature pixels that independently respond to changes in brightness, are becoming increasingly popular in high-speed applications due to their lower latency, reduced bandwidth requirements, and enhanced dynamic range compared to traditional frame-based cameras. Numerous imaging and vision techniques have leveraged event cameras for high-speed scene understanding by capturing high-framerate, high-dynamic range videos, primarily utilizing the temporal advantages inherent to event cameras. Additionally, imaging and vision techniques have utilized the light field---a complementary dimension to temporal information---for enhanced scene understanding. In this work, we propose "Event Fields", a new approach that utilizes innovative optical designs for event cameras to capture light fields at high speed. We develop the underlying mathematical framework for Event Fields and introduce two foundational frameworks to capture them practically: spatial multiplexing to capture temporal derivatives and temporal multiplexing to capture angular derivatives. To realize these, we design two complementary optical setups---one using a kaleidoscope for spatial multiplexing and another using a galvanometer for temporal multiplexing. We evaluate the performance of both designs using a custom-built simulator and real hardware prototypes, showcasing their distinct benefits. Our event fields unlock the full advantages of typical light fields—like post-capture refocusing and depth estimation—now supercharged for high-speed and high-dynamic range scenes. This novel light-sensing paradigm opens doors to new applications in photography, robotics, and AR/VR, and presents fresh challenges in rendering and machine learning.
4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
Hidenobu Matsuki · Gwangbin Bae · Andrew J. Davison
We propose the first tracking and mapping approach for a single RGB-D camera capable of non-rigid surface reconstruction via differentiable rendering. We perform 4D scene capture from an online stream by joint optimization of geometry, appearance, dynamics, and camera ego-motion. Although the natural environment contains complex non-rigid motions, non-rigid SLAM has remained difficult; even with 2.5D sensor measurements, it is still ill-posed due to the high dimensionality of the optimization problem. Our novel SLAM method based on Gaussian surface primitives allows accurate 3D reconstruction and real-time rendering without any template, using a warp-field represented by a multi-layer perceptron (MLP) and regularization terms to enable spatio-temporal reconstruction. A challenge in non-rigid SLAM research is the lack of publicly available datasets with reliable ground truth and standardized evaluation protocols. To address this, we introduce a novel synthetic dataset of everyday objects featuring diverse motions, leveraging availability of large-scale objects and advancements in animation modeling.
IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera
Jian Huang · Chengrui Dong · Xuanhua Chen · Peidong Liu
Implicit neural representation and explicit 3D Gaussian Splatting (3D-GS) for novel view synthesis have achieved remarkable progress with frame-based camera (e.g. RGB and RGB-D cameras) recently. Compared to frame-based camera, a novel type of bio-inspired visual sensor, \ie event camera, has demonstrated advantages in high temporal resolution, high dynamic range, low power consumption and low latency, which make it being favored for many robotic applications. In this work, we present IncEventGS, an incremental 3D Gaussian Splatting reconstruction algorithm with a single event camera, without the assumption of known camera poses. To recover the 3D scene representation incrementally, we exploit the tracking and mapping paradigm of conventional SLAM pipelines for IncEventGS. Given the incoming event stream, the tracker first estimates an initial camera motion based on prior reconstructed 3D-GS scene representation. The mapper then jointly refines both the 3D scene representation and camera motion based on the previously estimated motion trajectory from the tracker. The experimental results demonstrate that IncEventGS delivers superior performance compared to prior NeRF-based methods and other related baselines, even we do not have the ground-truth camera poses. Furthermore, our method can also deliver better performance compared to state-of-the-art event visual odometry methods in terms of camera motion estimation.
Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion
Zhiqiang Yan · Zhengxue Wang · Kun Wang · Jun Li · Jian Yang
In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates the mismatch and ambiguity caused by direct convolution over irregularly sampled sparse data. Subsequently, SigNet redefines completion as enhancement, establishing a self-supervised degradation bridge between the coarse depth and the targeted dense depth for effective RGB-D fusion. To achieve this, SigNet leverages the implicit degradation to adaptively select high-frequency components (e.g., edges) of RGB data to compensate for the coarse depth. This degradation is further integrated into a multi-modal conditional Mamba, dynamically generating the state coefficients to enable efficient global high-frequency information interaction. We conduct extensive experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the state-of-the-art (SOTA) performance of SigNet.
Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB
Nikhil Behari · Aaron Young · Siddharth Somasundaram · Tzofi Klinghoffer · Akshat Dave · Ramesh Raskar
3D surface reconstruction is essential across applications of virtual reality, robotics, and mobile scanning. However, RGB-based reconstruction often fails in low-texture, low-light, and low-albedo scenes. Handheld LiDARs, now common on mobile devices, aim to address these challenges by capturing depth information from time-of-flight measurements of a coarse grid of projected dots. Yet, these sparse LiDARs struggle with scene coverage on limited input views, leaving large gaps in depth information. In this work, we propose using an alternative class of "blurred" LiDAR that emits a diffuse flash, greatly improving scene coverage but introducing spatial ambiguity from mixed time-of-flight measurements across a wide field of view. To handle these ambiguities, we propose leveraging the complementary strengths of diffuse LiDAR with RGB. We introduce a Gaussian surfel-based rendering framework with a scene-adaptive loss function that dynamically balances RGB and diffuse LiDAR signals. We demonstrate that, surprisingly, diffuse LiDAR can outperform traditional sparse LiDAR, enabling robust 3D scanning with accurate color and geometry estimation in challenging environments.
Focal Split: Untethered Snapshot Depth from Differential Defocus
Junjie Luo · John Mamish · Alan Fu · Thomas Concannon · Josiah Hester · Emma Alexander · Qi Guo
Depth cameras promise to revolutionize mobile systems, but their size and power consumption limit their adoption.In this work we introduce Focal Split, the first handheld depth-from-differential-defocus (DfDD) camera with fully onboard power and compute. Unlike active illumination systems like LiDAR, we avoid power consumption associated with light sources, and our use of differential defocus sidesteps energy-intensive computation associated with passive triangulation methods like multi-view stereo and traditional depth-from-defocus.We extend DfDD theory around a portable, handheld opto-mechanical design which is robust due to its snapshot depth images. Our camera shows that a depth-from-defocus system can feasibly be operated in real-time on resource-constrained systems, with a battery life of 2 hours. Focal Split is DIY friendly. We include a guide to building the depth sensor using off-the-shelf optics, circuits, and mechanics with 3D-printed housing under \$500.
HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation
Mehdi Zayene · Albias Havolli · Jannik Endres · Charles Corbière · Alexandre Ben Ahmed Kontouli · Salim Cherkaoui · Alex Alahi
Despite considerable progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, consisting of 40K frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with diverse lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with a significantly increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. Results show that while recent stereo methods perform decently, a significant challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, achieving improved performance.
OFER: Occluded Face Expression Reconstruction
Pratheba Selvaraju · Victoria Abrevaya · Timo Bolkart · Rick Akkerman · Tianyu Ding · Faezeh Amjadi · Ilya Zharkov
Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity, where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. Although this addresses the ambiguity problem, the challenge remains to pick the best matching shape to ensure consistency across diverse expressions. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on the predicted shape accuracy scores to select the best match. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, with added ability to generate multiple expressions for a given image.
Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
Yuliang Guo · Sparsh Garg · S. Mahdi H. Miangoleh · Xinyu Huang · Liu Ren
Accurate metric depth estimation from monocular cameras is essential for applications such as autonomous driving, AR/VR, and robotics. While recent depth estimation methods demonstrate strong zero-shot generalization, achieving accurate metric depth across diverse camera types—particularly those with large fields of view (FoV) like fisheye and $360^\circ$ cameras—remains challenging. This paper introduces Depth Any Camera (DAC), a novel zero-shot metric depth estimation framework that extends a perspective-trained model to handle varying FoVs effectively. Notably, DAC is trained exclusively on perspective images, yet it generalizes seamlessly to fisheye and $360^\circ$ cameras without requiring specialized training. DAC leverages Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Key components include an efficient Image-to-ERP patch conversion for online ERP-space augmentation, a FoV alignment operation to support effective training across a broad range of FoVs, and multi-resolution data augmentation to address resolution discrepancies between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving $\delta_1$ accuracy by up to 50\% on multiple indoor fisheye and $360^\circ$ datasets, demonstrating robust generalization across camera types while relying only on perspective training data.
Order-One Rolling Shutter Cameras
Marvin Anas Hahn · Kathlén Kohn · Orlando Marigliano · Tomas Pajdla
Rolling shutter (RS) cameras dominate consumer and smartphone markets. Several methods for computing the absolute pose of RS cameras have appeared in the last 20 years, but the relative pose problem has not been fully solved yet. We provide a unified theory for the important class of order-one rolling shutter (RS$_1$) cameras. These cameras generalize the perspective projection to RS cameras, projecting a generic space point to exactly one image point via a rational map. We introduce a new back-projection RS camera model, characterize RS$_1$ cameras, construct explicit parameterizations of such cameras, and determine the image of a space line. We classify all minimal problems for solving the relative camera pose problem with linear RS$_1$ cameras and discover new practical cases. Finally, we show how the theory can be used to explain RS models previously used for absolute pose computation.
Research on bundle adjustment has focused on photo collections where each image is accompanied by its own set of camera parameters. However, real-world applications overwhelmingly call for shared intrinsics bundle adjustment (SI-BA) where camera parameters are shared across multiple images. Utilizing overlooked optimization opportunities specific to SI-BA, most notably matrix-free computation, we present a solver that is eight times faster than alternatives while consuming a tenth of the memory. Additionally, we examine reasons for BA instability under single-precision computation and propose minimal mitigations.
Towards In-the-wild 3D Plane Reconstruction from a Single Image
Jiachen Liu · Rui Yu · Sili Chen · Sharon X. Huang · Hengkai Guo
3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driving models on multiple domains, we have curated a large-scale (over 14 datasets and 560,000 images), high-resolution, densely-annotated planar benchmark from various indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. We will release all of the labeled data, code and models upon the acceptance of this paper.
Learning Affine Correspondences by Integrating Geometric Constraints
Pengju Sun · Banglei Guan · Zhenbao Yu · Yang Shang · Qifeng Yu · Daniel Barath
Abstract—Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The source code will be made public.
A novel vanishing point (VP) detection scheme, DiskVPS detects VPs with extreme efficiency via Hough Transform (HT) over an image-plane-mapped disk space. DiskVPS differs from the state-of-the-art (SOTA) algorithms that use the Gaussian Sphere(GS)-based VP detection models in which camera parameters are required and edge pairs cast votes. DiskVPS approach has two fundamental advantages in comparison to the other VP detection schemes: 1) the potential to achieve substantially higher accuracy at significantly faster processing speed by using individual edges rather than more error-prone and less efficient edge pairs as voters, and 2) the application of VP detection to all image types without the need for calibration as no camera parameters are involved in the algorithm. In a comparative experimental study, we demonstrate that the DiskVPS significantly outperforms the SOTA in detection accuracy and processing speed with real-world images.
From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting
Zhiwei Huang · Hailin Yu · Yichun Shentu · Jin Yuan · Guofeng Zhang
This paper presents a novel camera relocalization method, STDLoc, which leverages feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localization paradigm. Based on this scene representation, we introduce a novel matching-oriented Gaussian sampling strategy and a scene-specific detector to achieve efficient and robust initial pose estimation. Furthermore, based on the initial localization results, we align the query feature map to the Gaussian feature field by dense feature matching to enable accurate localization. The experiments on indoor and outdoor datasets show that STDLoc outperforms current state-of-the-art localization methods in terms of localization accuracy and recall. The code will be released after the paper is accepted.
RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges
Thibaut Loiseau · Guillaume Bourmaud
Camera pose estimation is crucial for many computer vision applications, yet existing benchmarks offer limited insight into method limitations across different geometric challenges. We introduce RUBIK, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels. Using three complementary criteria - overlap, scale ratio, and viewpoint angle - we organize 16.5K image pairs from nuScenes into 33 difficulty levels. Our comprehensive evaluation of 14 methods reveals that while recent detector-free approaches achieve the best performance (>47% success rate), they come with significant computational overhead compared to detector-based methods (150-600ms vs. 40-70ms). Even the best performing method succeeds on only 54.8% of the pairs, highlighting substantial room for improvement, particularly in challenging scenarios combining low overlap, large scale differences, and extreme viewpoint changes. Benchmark will be made publicly available.
MATCHA: Towards Matching Anything
Fei Xue · Sven Elflein · Laura Leal-Taixe · Qunjie Zhou
Establishing correspondences across images is a fundamental challenge in computer vision, underpinning tasks like Structure-from-Motion, image editing, and point tracking. Traditional methods are often specialized for specific correspondence types-geometric, semantic, or temporal — whereas humans naturally identify alignments across these domains. Inspired by this flexibility, we propose MATCHA, a unified feature model designed to “rule them all”, establishing robust correspondences across diverse matching tasks. Building on insights that diffusion model features can encode multiple correspondence types, MATCHA augments this capacity by dynamically fusing high-level semantic and low-level geometric features through an attention-based module, creating expressive, versatile, and robust features. Additionally, MATCHA integrates object-level features from DINOv2 to further boost generalization, enabling a single feature capable of “matching anything.” Extensive experiments validate that MATCHA consistently surpasses state-of-the-art methods across geometric, semantic, and temporal tasks, setting a new foundation for a unified approach for the fundamental correspondence problem in computer vision. To the best of our knowledge, MATCHA is the first approach that is able to effectively tackle diverse matching tasks with a single unified feature.
Scene-agnostic Pose Regression for Visual Localization
Junwei Zheng · Ruiping Liu · Yufan Chen · Zhenfang Chen · Kailun Yang · Jiaming Zhang · Rainer Stiefelhagen
Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at 3 different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. While the local branch focuses on the poses between consecutive adjacent frames, the global branch is designed for the pose between the query and origin frame. Extensive experiments and studies demonstrate the effectiveness of our SPR task, dataset, and methods. In unknown 360SPR scenes, our method outperforms APR (27.45m/47.01°) and RPR (11.92m/21.27°), achieving a significant reduction of error to 3.85m/3.97°. The dataset and code will be made publicly available.
Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision
Xinyue Zhang · Zijia Dai · Wanting Xu · Laurent Kneip
While automatically generated polynomial elimination templates have sparked great progress in the field of 3D computer vision, there remain many problems for which the degree of the constraints or the number of unknowns leads to intractability. In recent years, homotopy continuation has been introduced as a plausible alternative. However, the method currently depends on expensive parallel tracking of all possible solutions in the complex domain, or a classification network for starting problem-solution pairs trained over a limited set of real-world examples. Our innovation lies in a novel approach to finding solution-problem pairs, where we only need to predict a rough initial solution, with the corresponding problem generated by an online simulator. Subsequently, homotopy continuation is applied to track that single solution back to the original problem. We apply this elegant combination to generalized camera resectioning, and also introduce a new solution to the challenging generalized relative pose and scale problem. As demonstrated, the proposed method successfully compensates the raw error committed by the regressor alone, and leads to state-of-the-art efficiency and success rates.
GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
Shujuan Li · Yu-Shen Liu · Zhizhong Han
Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn the continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representations, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries.
ProbPose: A Probabilistic Approach to 2D Human Pose Estimation
Miroslav Purkrábek · Jiri Matas
Current Human Pose Estimation methods have achieved significant improvements. However, state-of-the-art models ignore out-of-image keypoints and use uncalibrated heatmaps as keypoint location representation. To address these limitations, we propose ProbPose, which predicts for each keypoint: a calibrated probability of keypoint presence at each location in the activation window, the probability of being outside of it, and its predicted visibility. To address the lack of evaluation protocols for out-of-image keypoints, we introduce the CropCOCO dataset and the Extended OKS (Ex-OKS) metric, which extends OKS to out-of-image points. Tested on COCO, CropCOCO, and OCHuman, ProbPose shows significant gains in out-of-image keypoint localization while also improving in-image localization through data augmentation. Additionally, the model improves robustness along the edges of the bounding box and offers better flexibility in keypoint evaluation. The codeand models will be released on the project website for research purposes.
Floating No More: Object-Ground Reconstruction from a Single Image
Yunze Man · Yichen Sheng · Jianming Zhang · Liangyan Gui · Yu-Xiong Wang
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.
ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting
Guo Junfu · Yu Xin · Gaoyi Liu · Kai Xu · Ligang Liu · Ruizhen Hu
We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.
GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
Weihang Li · Hongli XU · Junwen Huang · HyunJun Jung · Kuan-Ting Yu · Nassir Navab · Benjamin Busam
A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose first performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance's global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We then introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275.
Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features
Yuanbo Xiangli · Ruojin Cai · Hanyu Chen · Jeffrey Byrne · Noah Snavely
Accurate 3D reconstruction is frequently hindered by visual aliasing, where visually similar but distinct surfaces (aka, doppelgangers), are incorrectly matched. These spurious matches distort the structure-from-motion (SfM) process, leading to misplaced model elements and reduced accuracy. Prior efforts addressed this with CNN classifiers trained on curated datasets, but these approaches struggle to generalize across diverse real-world scenes and can require extensive parameter tuning. In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. Our contributions include a diversified training dataset that incorporates geo-tagged images from everyday scenes to expand robustness beyond landmark-based datasets. We further propose a Transformer-based classifier that leverages 3D-aware features from the MASt3R model, achieving superior precision and recall across both in-domain and out-of-domain tests. Doppelgangers++ integrates seamlessly into standard SfM and MASt3R-SfM pipelines, offering efficiency and adaptability across varied scenes. To evaluate SfM accuracy, we introduce an automated, geotag-based method for validating reconstructed models, eliminating the need for manual inspection. Through extensive experiments, we demonstrate that Doppelgangers++ significantly enhances pairwise visual disambiguation and improves 3D reconstruction quality in complex and diverse scenarios.
MITracker: Multi-View Integration for Visual Object Tracking
Mengjie Xu · Yitao Zhu · Haotian Jiang · Jiaming Li · Zhenrong Shen · Sheng Wang · Haolin Huang · Xinyu Wang · Han Zhang · Qing Yang · Qian Wang
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird’s eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance.
ETAP: Event-based Tracking of Any Point
Friedhelm Hamann · Daniel Gehrig · Filbert Febryanto · Kostas Daniilidis · Guillermo Gallego
Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion - thereby addressing an open challenge in purely event-based tracking - with a novel feature alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 135% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 19% improvement over the previous best event-only method and even surpasses the previous best events-and-frames method by 3.7%.
Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
Hoonhee Cho · Jae-Young Kang · Youngho Kim · Kuk-Jin Yoon
Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. Our code and dataset will be publicly available.
GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
Zechuan Li · Hongshan Yu · Yihao Ding · Jinhao Qiao · Basim Azam · Naveed Akhtar
We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields (NeRF). The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Our code and models will be made public after acceptance.
Preconditioners for the Stochastic Training of Neural Fields
Shin-Fang Chng · Hemanth Saratchandran · Simon Lucey
Neural fields encode continuous multidimensional signals as neural networks, enabling diverse applications in computer vision, robotics, and geometry. While Adam is effective for stochastic optimization, it often requires long training times. To address this, we explore alternative optimization techniques to accelerate training without sacrificing accuracy. Traditional second-order methods like L-BFGS are unsuitable for stochastic settings. We propose a theoretical framework for training neural fields with curvature-aware diagonal preconditioners, demonstrating their effectiveness across tasks such as image reconstruction, shape modeling, and Neural Radiance Fields (NeRF).
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
Chenhui Shi · Fulin Tang · Ning An · Yihong Wu
We propose 3D-SLNR, a new and ultra-lightweight neural representation with outstanding performance for large-scale 3D mapping. The representation defines a global signed distance function (SDF) in near-surface space based on a set of band-limited local SDFs anchored at support points sampled from point clouds. These SDFs are parameterized only by a tiny multi-layer perceptron (MLP) with no latent features, and the state of each SDF is modulated by three learnable geometric properties: position, rotation, and scaling, which make the representation adapt to complex geometries. Then, we develop a novel parallel algorithm tailored for this unordered representation to efficiently detect local SDFs where each sampled point is located, allowing for real-time updates of local SDF states during training. Additionally, a prune-and-expand strategy is introduced to enhance adaptability further. The synergy of our low-parameter model and its adaptive capabilities results in an extremely compact representation with excellent expressiveness. Extensive experiments demonstrate that our method achieves state-of-the-art reconstruction performance with less than 1/5 of the memory footprint compared with previous advanced methods.
PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors
Guangshun Wei · Yuan Feng · Long Ma · Chen Wang · Yuanfeng Zhou · Changjian Li
This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (i.e., images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details. The code, model, and datasets will be made publicly available upon publication.
STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds
Zikuan Li · Honghua Chen · Yuecheng Wang · Sibo Wu · Mingqiang Wei · Jun Wang
Extracting geometric edges from unstructured point clouds remains a significant challenge, particularly in thin-walled structures that are commonly found in everyday objects. Traditional geometric methods and recent learning-based approaches frequently struggle with these structures, as both rely heavily on sufficient contextual information from local point neighborhoods. However, 3D measurement data of thin-walled structures often lack the accurate, dense, and regular neighborhood sampling required for reliable edge extraction, resulting in degraded performance.In this work, we introduce STAR-Edge, a novel approach designed for detecting and refining edge points in thin-walled structures. Our method leverages a unique representation—the local spherical curve—to create structure-aware neighborhoods that emphasize co-planar points while reducing interference from close-by, non-co-planar surfaces. This representation is transformed into a rotation-invariant descriptor, which, combined with a lightweight multi-layer perceptron, enables robust edge point classification even in the presence of noise and sparse or irregular sampling. Besides, we also use the local spherical curve representation to estimate more precise normals and introduce an optimization function to project initially identified edge points exactly on the true edges.Experiments conducted on the ABC dataset and thin-walled structure-specific datasets demonstrate that STAR-Edge outperforms existing edge detection methods, showcasing better robustness under various challenging conditions. The source code is available in the supplemental file.
DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features
Zhangquan Chen · Puhua Jiang · Ruqi Huang
In this paper, we present DV-Matcher, a novel learning-based framework for estimating dense correspondences between non-rigidly deformable point clouds. Learning directly from unstructured point clouds without meshing or manual labelling, our framework delivers high-quality dense correspondences, which is of significant practical utility in point cloud processing. Our key contributions are two-fold: First, we propose a scheme to inject prior knowledge from pre-trained vision models into geometric feature learning, which effectively complements the local nature of geometric features with global and semantic information; Second, we propose a novel deformation-based module to promote the extrinsic alignment induced by the learned correspondences, which effectively enhances the feature learning. Experimental results show that our method achieves state-of-the-art results in matching non-rigid point clouds in both near-isometric and heterogeneous shape collection as well as more realistic partial and noisy data.
Mitigating Ambiguities in 3D Classification with Gaussian Splatting
Ruiqi Zhang · Hao Zhu · Jingyi Zhao · Qi Zhang · Xun Cao · Zhan Ma
3D classification with point cloud input is a fundamental problem in 3D vision. However, due to the discrete nature and the insufficient material description of point cloud representations, there are ambiguities in distinguishing wire-like and flat surfaces, as well as transparent or reflective objects. To address these issues, we propose Gaussian Splatting (GS) point cloud-based 3D classification. We find that the scale and rotation coefficients in the GS point cloud help characterize surface types. Specifically, wire-like surfaces consist of multiple slender Gaussian ellipsoids, while flat surfaces are composed of a few flat Gaussian ellipsoids. Additionally, the opacity in the GS point cloud represents the transparency characteristics of objects. As a result, ambiguities in point cloud-based 3D classification can be mitigated utilizing GS point cloud as input. To verify the effectiveness of GS point cloud input, we construct the first real-world GS point cloud dataset in the community, which includes 20 categories with 200 objects in each category. Experiments not only validate the superiority of GS point cloud input, especially in distinguishing ambiguous objects, but also demonstrate the generalization ability across different classification methods.
Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
Changfeng Ma · Ran Bi · Jie Guo · Chongjun Wang · Yanwen Guo
Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rending method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance.
SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds
Jinfeng Xu · Xianzhi Li · Yuan Tang · Xu Han · Qiao Yu · yixue Hao · Long Hu · Min Chen
Recent advancements in deep learning have greatly enhanced 3D object recognition, but most models are limited to closed-set scenarios, unable to handle unknown samples in real-world applications. Open-set recognition (OSR) addresses this limitation by enabling models to both classify known classes and identify novel classes.However, current OSR methods rely on global features to differentiate known and unknown classes, treating the entire object uniformly and overlooking the varying semantic importance of its different parts.To address this gap, we propose Salience-Aware Structured Separation (SASep), which includes (i) a tunable semantic decomposition (TSD) module to semantically decompose objects into important and unimportant parts, (ii) a geometric synthesis strategy (GSS) to generate pseudo-unknown objects by combining these unimportant parts, and (iii) a synth-aided margin separation (SMS) module to enhance feature-level separation by expanding the feature distributions between classes.Together, these components improve both geometric and feature representations, enhancing the model’s ability to effectively distinguish known and unknown classes.Experimental results show that SASep achieves superior performance in 3D OSR, outperforming existing state-of-the-art methods. We shall release our code and models upon publication of this work.
TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression
Xinjie Wang · Yifan Zhang · Ting Liu · Xinpu Liu · Ke Xu · Jianwei Wan · Yulan Guo · Hanyun Wang
Efficient Point Cloud Geometry Compression (PCGC) with a lower bits per point (BPP) and higher peak signal-to-noise ratio (PSNR) is essential for the transportation of large-scale 3D data. Although octree-based entropy models can reduce BPP without introducing geometry distortion, existing CNN-based models struggle with limited receptive fields to capture long-range dependencies, while Transformer-built architectures always neglect fine-grained details due to their reliance on global self-attention. In this paper, we propose a Transformer-efficient occupancy prediction network, termed TopNet, to overcome these challenges by developing several novel components: Locally-enhanced Context Encoding (LeCE) for enhancing the translation-invariance of the octree nodes, Adaptive-Length Sliding Window Attention (AL-SWA) for capturing both global and local dependencies while adaptively adjusting attention weights based on the input window length, Spatial-Gated-enhanced Channel Mixer (SG-CM) for efficient feature aggregation from ancestors and siblings, and Latent-guided Node Occupancy Predictor (LNOP) for improving prediction accuracy of spatially adjacent octree nodes. Comprehensive experiments across both indoor and outdoor point cloud datasets demonstrate that our TopNet achieves state-of-the-art performance with fewer parameters, further advancing the reduction-efficiency boundaries of PCGC.
A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions
Qiang Li · Jian Ruan · Fanghao Wu · Yuchi Chen · Zhihua Wei · Wen Shen
Recently, many self-supervised pre-training methods have been proposed to improve the performance of deep neural networks (DNNs) for 3D point clouds processing. However, the common mechanism underlying the effectiveness of different pre-training methods remains unclear. In this paper, we use game-theoretic interactions as a unified approach to explore the common mechanism of pre-training methods. Specifically, we decompose the output score of a DNN into the sum of numerous effects of interactions, with each interaction representing a distinct 3D substructure of the input point cloud. Based on the decomposed interactions, we draw the following conclusions. (1) The common mechanism across different pre-training methods is that they enhance the strength of high-order interactions encoded by DNNs, which represent complex and global 3D structures, while reducing the strength of low-order interactions, which represent simple and local 3D structures. (2) Sufficient pre-training and adequate fine-tuning data for downstream tasks further reinforce the mechanism described above. (3) Pre-training methods carry a potential risk of reducing the transferability of features encoded by DNNs. Inspired by the observed common mechanism, we propose a new method to directly enhance the strength of high-order interactions and reduce the strength of low-order interactions encoded by DNNs, improving performance without the need for pre-training on large-scale datasets. Experiments show that our method achieves performance comparable to traditional pre-training methods.
An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models
Wentao Qu · Jing Wang · Yongshun Gong · Xiaoshui Huang · Liang Xiao
Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic Segmentation Network based on a \textbf{C}onditional-Noise Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.
PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram
Sifan Zhou · Zhihang Yuan · Dawei Yang · Ziyu Zhao · Jian Qian · Xing Hu
Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder, called PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency and performance.
Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
Yante Li · Hanwen Qi · Haoyu Chen · Liang Xinlian · Guoying Zhao
In environmental protection, tree monitoring plays an essential role in maintaining and improving ecosystem health. However, precise monitoring is challenging because existing datasets fail to capture continuous fine-grained changes in trees due to low-resolution images and high acquisition costs. In this paper, we introduce UAVTC, a large-scale, long-term, high-resolution dataset collected using UAVs equipped with cameras, specifically designed to detect individual Tree Changes (TCs). UAVTC includes rich annotations and statistics based on biological knowledge, offering a fine-grained view for tree monitoring. To address environmental influences and effectively model the hierarchical diversity of physiological TCs, we propose a novel Hyperbolic Siamese Network (HSN) for TC detection, enabling compact and hierarchical representations of dynamic tree changes. Extensive experiments show that HSN can effectively capture complex hierarchical changes and provide a robust solution for fine-grained TC detection. In addition, HSN generalizes well to cross-domain face anti-spoofing task, highlighting its broader significance in AI. We believe our work, combining ecological insights and interdisciplinary expertise, will benefit the community by offering a new benchmark and innovative AI technologies. Source code and dataset will be made available.
GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection
Dušan Malić · Christian Fruhwirth-Reisinger · Samuel Schulter · Horst Possegger
LiDAR-based 3D detectors need large datasets for training, yet they struggle to generalize to novel domains. Domain Generalization (DG) aims to mitigate this by training detectors that are invariant to such domain shifts. Current DG approaches exclusively rely on global geometric features (point cloud Cartesian coordinates) as input features. Over-reliance on these global geometric features can, however, cause 3D detectors to prioritize object location and absolute position, resulting in poor cross-domain performance. To mitigate this, we propose to exploit explicit local point cloud structure for DG, in particular by encoding point cloud neighborhoods with Gaussian blobs, GBlobs. Our proposed formulation is highly efficient and requires no additional parameters. Without any bells and whistles, simply by integrating GBlobs in existing detectors, we beat the current state-of-the-art in challenging single-source DG benchmarks by over 21 mAP (Waymo->KITTI), 13 mAP (KITTI->Waymo), and 12 mAP (nuScenes->KITTI), without sacrificing in-domain performance. Additionally, GBlobs demonstrate exceptional performance in multi-source DG, surpassing the current state-of-the-art by 17, 12, and 5 mAP on Waymo, KITTI, and ONCE, respectively.
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
Xiang Xu · Lingdong Kong · hui shuai · Liang Pan · Ziwei Liu · Qingshan Liu
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across 11 large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code will be made publicly accessible.
Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation
Chuandong Liu · Xingxing Weng · Shuguo Jiang · Pengcheng Li · Lei Yu · Gui-Song Xia
This paper explores scene affinity (AIScene), namely intra-scene consistency and inter-scene correlation, for semi-supervised LiDAR semantic segmentation in driving scenes. Adopting teacher-student training, AIScene employs a teacher network to generate pseudo-labeled scenes from unlabeled data, which then supervise the student network's learning. Unlike most methods that include all points in pseudo-labeled scenes for forward propagation but only pseudo-labeled points for backpropagation, AIScene removes points without pseudo-labels, ensuring consistency in both forward and backward propagation within the scene. This simple point erasure strategy effectively prevents unsupervised, semantically ambiguous points (excluded in backpropagation) from affecting the learning of pseudo-labeled points. Moreover, AIScene incorporates patch-based data augmentation, mixing multiple scenes at both scene and instance levels. Compared to existing augmentation techniques that typically perform scene-level mixing between two scenes, our method enhances the semantic diversity of labeled (or pseudo-labeled) scenes, thereby improving the semi-supervised performance of segmentation models. Experiments show that AIScene outperforms previous methods on two popular benchmarks across four settings, achieving notable improvements of 1.9\% and 5.3\% in the most challenging 1\% labeled data.
V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection
Xun Huang · Jinlong Wang · Qiming Xia · Siheng Chen · Bisheng Yang · Xin Li · Cheng Wang · Chenglu Wen
Current Vehicle-to-Everything (V2X) systems have significantly enhanced 3D object detection using LiDAR and camera data. However, they face performance degradation in adverse weather. Weather-robust 4D radar, with Doppler velocity and additional geometric information, offers a promising solution to this challenge. To this end, we present V2X-R, the first simulated V2X dataset incorporating LiDAR, camera, and 4D radar modalities. V2X-R contains 12,079 scenarios with 37,727 frames of LiDAR and 4D radar point clouds, 150,908 images, and 170,859 annotated 3D vehicle bounding boxes. Subsequently, we propose a novel cooperative LiDAR-4D radar fusion pipeline for 3D object detection and implement it with multiple fusion strategies. To achieve weather-robust detection, we additionally propose a Multi-modal Denoising Diffusion (MDD) module in our fusion pipeline. MDD utilizes weather-robust 4D radar feature as a condition to guide the diffusion model in denoising noisy LiDAR features.Experiments show that our LiDAR-4D radar fusion pipeline demonstrates superior performance in the V2X-R dataset. Over and above this, our MDD module further improved the foggy/snowy performance of the basic fusion model by up to 5.73\%/6.70\% and barely disrupting normal performance. The dataset and code will be publicly available.
Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection
Jinhyung Park · Navyata Sanghvi · Hiroki Adachi · Yoshihisa Shibata · Shawn Hunt · Shinya Tanaka · Hironobu Fujiyioshi · Kris Kitani
While recent advancements in camera-based 3D object detection demonstrate remarkable performance, they require thousands or even millions of human-annotated frames. This requirement significantly inhibits their deployment in various locations and sensor configurations. To address this gap, we propose a performant semi-supervised framework that leverages unlabeled RGB-only driving sequences - data easily collected with cost-effective RGB cameras - to significantly improve temporal, camera-only 3D detectors. We observe that the standard semi-supervised pseudo-labeling paradigm under-performs in this temporal, camera-only setting due to poor 3D localization of pseudo-labels. To address this, we train a single 3D detector to handle RGB sequences both forwards and backwards in time, then ensemble both its forwards and backwards pseudo-labels for semi-supervised learning. We further improve the pseudo-label quality by leveraging 3D object tracking to in-fill missing detections and by eschewing simple confidence thresholding in favor of using the auxiliary 2D detection head to filter 3D predictions. Finally, to enable the backbone to learn directly from the unlabeled data itself, we introduce an object-query conditioned masked reconstruction objective. Our framework demonstrates remarkable performance improvement on large-scale autonomous driving datasets nuScenes and nuPlan.
CorrBEV: Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes
ziteng xue · Mingzhe Guo · Heng Fan · Shihui Zhang · Zhipeng Zhang
Camera-only multi-view 3D object detection in autonomous driving has witnessed encouraging developments in recent years, largely attributed to the revolution of fundamental architectures in modeling bird's eye view (BEV). Despite the growing overall average performance, we contend that the exploration of more specific and challenging corner cases hasn't received adequate attention. In this work, we delve into a specific yet critical issue for safe autonomous driving: occlusion. To alleviate this challenge, we draw inspiration from the human amodal perception system, which is proven to have the capacity for mentally reconstructing the complete semantic concept of occluded objects with prior knowledge. More specifically, we introduce auxiliary visual and language prototypes, akin to human prior knowledge, to enhance the diminished object features caused by occlusion. Inspired by Siamese object tracking, we fuse the information from these prototypes with the baseline model through an efficient depth-wise correlation, thereby enhancing the quality of object-related features and guiding the learning of 3D object queries, especially for partially occluded ones. Furthermore, we propose the random pixel drop to mimic occlusion and the multi-modal contrastive loss to align visual features of different occlusion levels to a unified space during training. Our inspiration originates from addressing occlusion, however, we are surprised to find that the proposed framework also enhances robustness in various challenging scenarios that diminish object representation, such as inclement weather conditions. By applying our model to different baselines, i.e., BEVFormer and SparseBEV, we demonstrate consistent improvements.
CroCoDL: Cross-device Collaborative Dataset for Localization
Hermann Blum · Alessandro Mercurio · Joshua O'Reilly · Tim Engelbracht · Mihai Dusmanu · Marc Pollefeys · Zuria Bauer
Accurate localization plays a pivotal role in the autonomy of systems operating in unfamiliar environments, particularly when interaction with humans is expected. High-accuracy visual localization systems encompass various components, such as feature extractors, matchers, and pose estimation methods. This complexity translates to the necessity of robust evaluation settings and pipelines. However, existing datasets and benchmarks primarily focus on single-agent scenarios, overlooking the critical issue of cross-device localization. Different agents with different sensors will show their own specific strengths and weaknesses, and the data they have available varies substantially. This work addresses this gap by enhancing an existing augmented reality visual localization benchmark with data from legged robots, and evaluating human-robot, cross-device mapping and localization. Our contributions extend beyond device diversity and include high environment variability, spanning ten distinct locations ranging from disaster sites to art exhibitions. Each scene in our dataset features recordings from robot agents, hand-held and head-mounted devices, and high-accuracy ground truth LiDAR scanners, resulting in a comprehensive multi-agent dataset and benchmark. This work represents a significant advancement in the field of visual localization benchmarking, with key insights into the performance of cross-device localization methods across diverse settings.
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
Tomas Soucek · Prajwal Gatti · Michael Wray · Ivan Laptev · Dima Damen · Josef Sivic
The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with the provided input image. Third, we evaluate the generated image sequences across three dimensions of accuracy (step, scene, and task) and show our model achieves state-of-the-art results on all of them. Our code, dataset, and trained models will be publicly available.
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
Haisheng Su · Feixiang Song · CONG MA · Wei Wu · Junchi Yan
Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl · Quinlan Sykora · Ben Agro · Thomas Gilles · Sergio Casas · Raquel Urtasun
We present DIO, a flexible world model that can estimate the scene occupancy-flow from a sparse set of LiDAR observations, and decompose it into individual instances. DIO can not only complete instance shapes at the present time, but also forecast their occupancy-flow evolution over a future horizon. Thanks to its flexible prompt representation, DIO can take instance prompts from off-the-shelf models like 3D detectors, achieving state-of-the-art performance in the task of 4D semantic occupancy completion and forecasting on the Argoverse 2 dataset. Moreover, our world model can easily and effectively be transferred to downstream tasks like LiDAR point cloud forecasting, ranking first compared to all baselines in the Argoverse 4D occupancy forecasting challenge.
EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory
Jonas Kälble · Sascha Wirges · Maxim Tatarchenko · Eddy Ilg
We present EvOcc, a novel evidential semantic occupancy mapping framework. It consists of two parts: (1) an evidential approach for calculating the ground-truth 3D semantic occupancy maps from noisy LiDAR measurements, and (2) a method for training image-based occupancy estimation models through a new loss formulation. In contrast to state-of-the-art semantic occupancy maps, our approach explicitly models the uncertainty introduced by unobserved spaces or contradicting measurements and we show that using it results in significantly stronger models. Evaluated as ray-based mIoU, our evidential semantic occupancy mapping approach improves over the baselines by at least $15.8\\%$ for the ground truth and $5.5\\%$ for the trained model. Overall, we make a significant contribution towards more detailed and uncertainty-aware 3D environment understanding and safe operation in autonomous driving.
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
Yuanhui Huang · Amonnut Thammatadatrakoon · Wenzhao Zheng · Yunpeng Zhang · Dalong Du · Jiwen Lu
3D semantic occupancy prediction has garnered attention as an important task for the robustness of vision-centric autonomous driving, which predicts fine-grained geometry and semantics of the surrounding scene. Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes, which leads to computational redundancy. Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. Furthermore, we adopt the exact Gaussian mixture model for semantics calculation to avoid unnecessary overlapping of Gaussians. To effectively initialize Gaussians in non-empty region, we design a distribution-based initialization module which learns the pixel-aligned occupancy distribution instead of the depth of surfaces. We conduct extensive experiments on nuScenes and KITTI-360 datasets and our GaussianFormer-2 achieves state-of-the-art performance with high efficiency.
SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
Su Sun · Cheng Zhao · Zhuoyang Sun · Yingjie Chen · Mei Chen
Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB/depth/flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively.NMFF also models the status correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic scene identification by distilling features from 2D foundational models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo Open Dataset and KITTI Dataset validate SplatFlow's state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios. The code and model will be released upon the paper's acceptance.
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
Hongbin Lin · Zilu Guo · Yifan Zhang · Shuaicheng Niu · Yafeng Li · Ruimao Zhang · Shuguang Cui · Zhen Li
In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: we empirically find that self-attention features are semantic-aware but tend to be relatively coarse for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detectors across 13 OOD scenarios.
GLane3D: Detecting Lanes with Graph of 3D Keypoints
Halil İbrahim Öztürk · Muhammet Esat Kalfaoglu · Ozsel Kilinc
Accurate and efficient lane detection in 3D space is essential for autonomous driving systems, where robust generalization is the foremost requirement for 3D lane detection algorithms. Considering the extensive variation in lane structures worldwide, achieving high generalization capacity is particularly challenging, as algorithms must accurately identify a wide variety of lane patterns worldwide. Traditional top-down approaches rely heavily on learning lane characteristics from training datasets, often struggling with lanes exhibiting previously unseen attributes. To address this generalization limitation, we propose a method that detects keypoints of lanes and subsequently predicts sequential connections between them to construct complete 3D lanes. Each key point is essential for maintaining lane continuity, and we predict multiple proposals per keypoint by allowing adjacent grids to predict the same keypoint using an offset mechanism. PointNMS is employed to eliminate overlapping proposal keypoints, reducing redundancy in the estimated BEV graph and minimizing computational overhead from connection estimations. Our model surpasses previous state-of-the-art methods on both the Apollo and OpenLane datasets, demonstrating superior F1 scores and a strong generalization capacity when models trained on OpenLane are evaluated on the Apollo dataset, compared to prior approaches.
UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation
Yichong Lu · Yichi Cai · Shangzhan Zhang · Hongyu Zhou · Haoji Hu · Huimin Yu · Andreas Geiger · Yiyi Liao
Photorealistic 3D vehicle models with high controllability are essential for autonomous driving simulation and data augmentation. While handcrafted CAD models provide flexible controllability, free CAD libraries often lack the high-quality materials necessary for photorealistic rendering. Conversely, reconstructed 3D models offer high-fidelity rendering but lack controllability. In this work, we introduce UrbanCAD, a framework that pushes the frontier of the photorealism-controllability trade-off by generating highly controllable and photorealistic 3D vehicle digital twins from a single urban image and a collection of free 3D CAD models and handcrafted materials. These digital twins enable realistic 360-degree rendering, vehicle insertion, material transfer, relighting, and component manipulation such as opening doors and rolling down windows, supporting the construction of long-tail scenarios. To achieve this, we propose a novel pipeline that operates in a retrieval-optimization manner, adapting to observational data while preserving flexible controllability and fine-grained handcrafted details. Furthermore, given multi-view background perspective and fisheye images, we approximate environment lighting using fisheye images and reconstruct the background with 3DGS, enabling the photorealistic insertion of optimized CAD models into rendered novel view backgrounds. Experimental results demonstrate that UrbanCAD outperforms baselines based on reconstruction and retrieval in terms of photorealism. Additionally, we show that various perception models maintain their accuracy when evaluated on UrbanCAD with in-distribution configurations but degrade when applied to realistic out-of-distribution data generated by our method. This suggests that UrbanCAD is a significant advancement in creating photorealistic, safety-critical driving scenarios for downstream applications.
DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
Tianyi Yan · Dongming Wu · Wencheng Han · Junpeng Jiang · xia zhou · Kun Zhan · Cheng-Zhong Xu · Jianbing Shen
Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While the recent efforts of closed-loop simulation offer feedback-driven environments, they cannot process visual sensor inputs or produce outputs that differ from real-world data. To address these challenges, we propose DrivingSphere, a realistic and closed-loop simulation framework. Its core idea is to build 4D world representation and generate real-life and controllable driving scenarios. In specific, our framework includes a Dynamic Environment Composition module that constructs a detailed 4D driving world with a format of occupancy equipping with static backgrounds and dynamic objects, and a Visual Scene Synthesis module that transforms this data into high-fidelity, multi-view video outputs, ensuring spatial and temporal consistency. By providing a dynamic and realistic simulation environment, DrivingSphere enables comprehensive testing and validation of autonomous driving algorithms, ultimately advancing the development of more reliable autonomous cars.The benchmark will be publicly released.
Causal Composition Diffusion Model for Closed-loop Traffic Generation
Haohong Lin · Xin Huang · Tung Phan-Minh · David S Hayden · Huan Zhang · DING ZHAO · Siddhartha Srinivasa · Eric M. Wolff · Hongge Chen
Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating realistic and controllable traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the Causal Compositional Diffusion Model (CCDiff), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff’s effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.
Towards Autonomous Micromobility through Scalable Urban Simulation
Wayne Wu · Honglin He · Chaoyuan Zhang · Jack He · Seth Z. Zhao · Ran Gong · Quanyi Li · Bolei Zhou
Micromobility, which utilizes lightweight devices moving in urban public spaces - such as delivery robots and electric wheelchairs - emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM -- a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH -- a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot's unique strengths and limitations. This work will be open-sourced and under sustainable maintenance to foster future research in autonomous micromobility.
Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting
Kaouther Messaoud · Matthieu Cord · Alex Alahi
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. It is often due to limitations like complex architectures customized for a specific dataset and inefficient multimodal handling. We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details. Additionally, our approach of reconstructing segment-level trajectories and lane segments from masked inputs with query drop, enables effective use of contextual information and improves generalization; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation. PerReg+ sets a new state-of-the-art performance on nuScenes, Argoverse 2, and Waymo Open Motion Dataset (WOMD). Remarkable, our pretrained model reduces the error by 6.8% on smaller datasets, and multi-dataset training enhances generalization. In cross-domain tests, PerReg+ reduces B-FDE by 11.8% compared to its non-pretrained variant.
Distilling Multi-modal Large Language Models for Autonomous Driving
Deepti Hegde · Rajeev Yasarla · Hong Cai · Shizhong Han · Apratim Bhattacharyya · Shweta Mahajan · Litian Liu · Risheek Garrepalli · Vishal M. Patel · Fatih Porikli
Autonomous driving demands safe motion planning, especially in critical "long-tail'' scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in long-tail scenarios. \ours also achieves state-of-the-art performance on the nuScenes planning benchmark.
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
Mingfei Han · Liang Ma · Kamila Zhumakhanova · Ekaterina Radionova · Jingyi Zhang · Xiaojun Chang · Xiaodan Liang · Ivan Laptev
Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators.To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments.We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE.Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
Exploration-Driven Generative Interactive Environments
Nedko Savov · Naser Kazemi · Mohammad Mahdi · Danda Paudel · Xi Wang · Luc Van Gool
Modern world models require costly and time consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates generalization abilities on many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent which entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific reward, thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity improvement of up to 6.7 PSNR and controllability of up to 1.3 $\Delta$PSNR.In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 975 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G.
Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning
Chenjie Hao · Weyl Lu · Yifan Xu · Yubei Chen
An embodied system must not only model the patterns of the external world but also understand its own motion dynamics. A motion dynamic model is essential for efficient skill acquisition and effective planning. In this work, we introduce Neural Motion Simulator (MoSim), a world model that predicts the physical future state of an embodied system based on current observations and actions. MoSim achieves state-of-the-art performance in physical state prediction, also provides competitive performance across a range of downstream tasks. This model enables embodied systems to perform long-horizon predictions, facilitating efficient skill acquisition in imagined environments and even enabling zero-shot reinforcement learning learning. Furthermore, MoSim can transform any model-free reinforcement learning (RL) algorithm into a model-based approach, effectively decoupling the physical environment modeling from RL algorithm development. This separation allows for independent advancements in RL algorithms and world modeling, significantly improving sample efficiency and enhancing generalization capabilities. Our findings highlight that modeling world models for motion dynamics is a promising direction for developing more versatile and capable embodied systems.
Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding
Yuxuan Wang · Aming Wu · Muli Yang · Yukuan Min · Yihang Zhu · Cheng Deng
This paper pays attention to Weakly Supervised Affordance Grounding (WSAG) task that aims to train model to identify affordance regions using human-object interaction images and egocentric images without the need for costly pixel-level annotations. Most existing methods usually consider the affordance regions to be isolated and directly employ class activation maps to conduct localization, ignoring the relation with other object components and weakening the performance. For example, a cup’s handle is combined with its body to achieve the pouring ability. Obviously, capturing the region relations is beneficial for improving the localization accuracy of affordance regions. To this end, we first explore exploiting hypergraph to discover these relations and propose a Reasoning Mamba (R-Mamba) framework.We first extract feature embeddings from exocentric and egocentric images to construct the hypergraphs consisting of multiple vertices and hyperedges, which capture the in-context local region relationships between different visual components. Subsequently, we design a Hypergraph-guided State Space (HSS) block to reorganize these local relationships from the global perspective. By this mechanism, the model could leverage the captured relationships to improve the localization accuracy of affordance regions. Extensive experiments and visualization analyses demonstrate the superiorities of our method.
AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration
Jiong Lin · Lechen Zhang · Kwansoo Lee · Jialong Ning · Judah A Goldfeder · Hod Lipson
Robot description models are essential for simulation and control, yet their creation often requires significant manual effort. To streamline this modeling process, we introduce AutoURDF, an unsupervised approach for constructing description files for unseen robots from point cloud frames. Our method leverages a cluster-based point cloud registration model that tracks the 6-DoF transformations of point clusters. Through analyzing cluster movements, we hierarchically address the following challenges: (1) moving part segmentation, (2) body topology inference, and (3) joint parameter estimation. The complete pipeline produces robot description files that are fully compatible with existing simulators. We validate our method across a variety of robots, using both synthetic and real-world scan data. Results indicate that our approach outperforms previous methods in registration and body topology estimation accuracy, offering a scalable solution for automated robot modeling.
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Xiaoqi Li · Lingyun Xu · Mingxu Zhang · Jiaming Liu · Yan Shen · Iaroslav Ponomarenko · Jiahui Xu · Liang Heng · Siyuan Huang · Shanghang Zhang · Hao Dong
In robotic manipulation, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To address these challenges, we propose a novel approach using comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner.Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space.Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts.We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Yao Mu · Tianxing Chen · Zanxin Chen · ShijiaPeng · Zhiqian Lan · Zeyu Gao · Zhixuan Liang · Qiaojun Yu · Yude Zou · Mingkun Xu · Lunkai Lin · Zhiqiang Xie · Mingyu Ding · Ping Luo
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples improve the success rate of over 70\% for single-arm tasks and over 40\% for dual-arm tasks compared to models trained solely on real-world data. This significant improvement demonstrates RoboTwin's potential to enhance the development and evaluation of dual-arm robotic manipulation systems.
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
Hanzhi Chen · Boyang Sun · Anran Zhang · Marc Pollefeys · Stefan Leutenegger
Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable. Our project will be open-sourced upon acceptance.
Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References
Yitang Li · Mingxian Lin · Zhuo Lin · Yipeng Deng · Yue Cao · Li Yi
Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.
TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation
Hongxiang Zhao · Xingchen Liu · Mutian Xu · Yiming Hao · Weikai Chen · Xiaoguang Han
We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks.Towards this end, we introduce Roger - a pioneering large-scale dataset of 103,856 ego-centric hand-object interaction videos.Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on Roger, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation.The Roger dataset will be made publicly available upon publication to foster further advancements in the field.
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects
Wanyue Zhang · Rishabh Dabral · Vladislav Golyanik · Vasileios Choutas · Eduardo Alvarado · Thabo Beeler · Marc Habermann · Christian Theobalt
We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating.To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that excel over the state-of-the-art in motion quality and diversity.
End-to-End HOI Reconstruction Transformer with Graph-based Encoding
Zhenrong Wang · Qi Zheng · Sihan Ma · Maosheng Ye · Yibing Zhan · Dongjiang Li
Human-object interaction (HOI) reconstruction has garnered significant attention due to its diverse applications and the success of capturing human meshes. Existing HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.
Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
Zhengdi Yu · Stefanos Zafeiriou · Tolga Birdal
We propose Dyn-HaMR, to the best of our knowledge, the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild. Reconstructing accurate 3D hand meshes from monocular videos is a crucial task for understanding human behaviour, with significant applications in augmented and virtual reality (AR/VR). However, existing methods for monocular hand reconstruction typically rely on a weak perspective camera model, which simulates hand motion within a limited camera frustum. As a result, these approaches struggle to recover the full 3D global trajectory and often produce noisy or incorrect depth estimations, particularly when the video is captured by dynamic or moving cameras, which is common in egocentric scenarios. Our \name~consists of a multi-stage, multi-objective optimization pipeline, that factors in (i) simultaneous localization and mapping (SLAM) to robustly estimate relative camera motion, (ii) an interacting-hand prior for generative infilling and to refine the interaction dynamics, ensuring plausible recovery under (self-)occlusions, and (iii) hierarchical initialization through a combination of state-of-the-art hand tracking methods.
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Yiming Zhao · Taein Kwon · Paul Streli · Marc Pollefeys · Christian Holz
Estimating touch contact and pressure in egocentric vision is a central task for downstream applications in Augmented Reality, Virtual Reality, as well as many robotic applications, because it provides precise physical insights into hand-object interaction and object manipulation. However, existing contact pressure datasets lack egocentric views and hand poses, which are essential for accurate estimation during in-situ operation, both for AR/VR interaction and robotic manipulation.In this paper, we introduce a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact. The hand poses in our dataset are optimized using our proposed multi-view sequence-based method that processes footage from our capture rig of 8 accurately calibrated RGBD cameras. comprises 5.0 hours of touch contact and pressure interaction from 21 participants captured by a moving egocentric camera and 7 stationary Kinect cameras, which provided RGB images and depth maps at 30 Hz. In addition, we provide baselines for estimating pressure with different modalities, which will enable future developments and benchmarking on the dataset. Overall, we demonstrate that pressure and hand poses are complementary, which supports our intention to better facilitate the physical understanding of hand-object interactions in AR/VR and robotics research
PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing
Ziyu Wu · Yufan Xiong · Mengting Niu · Fangting Xie · Quan Wan · Qijun Ying · Boyan Liu · Xiaohui Cai
Long-term in-bed monitoring benefits automatic and real-time health management within healthcare, and the advancement of human shape reconstruction technologies further enhances the representation and visualization of users' activity patterns. However, existing technologies are primarily based on visual cues, facing serious challenges in non-light-of-sight and privacy-sensitive in-bed scenes. Pressure-sensing bedsheets offer a promising solution for real-time motion reconstruction. Yet, limited exploration in model designs and data have hindered its further development. To tackle these issues, we propose a general framework that bridges gaps in data annotation and model design. Firstly, we introduce SMPLify-IB, an optimization method that overcomes the depth ambiguity issue in top-view scenarios through gravity constraints, enabling generating high-quality 3D human shape annotations for in-bed datasets. Then we present PI-HMR, a temporal-based human shape estimator to regress meshes from pressure sequences. By integrating multi-scale feature fusion with high-pressure distribution and spatial position priors, PI-HMR outperforms SOTA methods with 17.01mm Mean-Per-Joint-Error decrease. This work provides a whole tool-chain to support the development of in-bed monitoring with pressure contact sensing.
MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation
Jae-Ho Choi · Soheil Hor · Shubo Yang · Amin Arbabian
One of the main challenges in reliable camera-based 3D pose estimation for walking subjects is to deal with self-occlusions, especially in the case of using low-resolution cameras or at longer distance scenarios. In recent years, millimeter-wave (mmWave) radar has emerged as a promising alternative, offering inherent resilience to the effect of occlusions and distance variations. However, mmWave-based human walking pose estimation (HWPE) is still in the nascent development stages, primarily due to its unique set of practical challenges including the quality of the observed radar signal dependent on the subject’s motion direction. This paper introduces the first comprehensive study comparing mmWave radar to camera systems for HWPE, highlighting its utility for distance-agnostic and occlusion-resilient pose estimation. Building upon mmWave’s unique advantages, we address its intrinsic directionality issue through a new approach—the synergetic integration of multi-modal, multi-view mmWave signals, achieving robust HWPE against variations both in distance and walking direction. Extensive experiments on a newly curated dataset not only demonstrate the superior potential of mmWave technology over traditional camera-based HWPE systems, but also validate the effectiveness of our approach in overcoming the core limitations of mmWave HWPE.
MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond
Shenghao Ren · Yi Lu · Jiayi Huang · Jiayi Zhao · He Zhang · Tao Yu · Qiu Shen · Xun Cao
Existing human Motion Capture (MoCap) method mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale Human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small-kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy for fusing pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence.
MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis
Yinghao Wu · Shihui Guo · Yipeng Qin
While data augmentation (DA) has been extensively studied in computer vision, its application to Inertial Measurement Unit (IMU) signals remains largely unexplored, despite IMUs' growing importance in human motion analysis. In this paper, we present the first systematic study of IMU-specific data augmentation, beginning with a comprehensive analysis that identifies three fundamental properties of IMU signals: their time-series nature, inherent multimodality (rotation and acceleration) and motion-consistency characteristics. Through this analysis, we demonstrate the limitations of applying conventional time-series augmentation techniques to IMU data. We then introduce Motion-Drift Augmentation (MODA), a novel technique that simulates the natural displacement of body-worn IMUs during motion. We evaluate our approach across five diverse datasets and five deep learning settings, including i) fully-supervised, ii) semi-supervised, iii) domain adaptation, iv) domain generalization and v) few-shot learning for both Human Action Recognition (HAR) and Human Pose Estimation (HPE) tasks. Experimental results show that our proposed MODA consistently outperforms existing augmentation methods, with semi-supervised learning performance approaching state-of-the-art fully-supervised methods.
Homogeneous Dynamics Space for Heterogeneous Humans
Xinpeng Liu · Junxuan Liang · Chenshuo Zhang · Zixuan Cai · Cewu Lu · Yonglu Li
Analyses of human motion kinematics have achieved tremendous advances. However, the production mechanism, known as human dynamics, is still undercovered. In this paper, we aim to push data-driven human dynamics understanding forward. We identify a major obstacle to this as the heterogeneity of existing human motion understanding efforts. Specifically, heterogeneity exists in not only the diverse kinematics representations and hierarchical dynamics representations but also in the data from different domains, namely biomechanics and reinforcement learning. With an in-depth analysis of the existing heterogeneity, we propose to emphasize the beneath homogeneity: all of them represent the homogeneous fact of human motion, though from different perspectives. Given this, we propose Homogeneous Dynamics Space (HDyS) as a fundamental space for human dynamics by aggregating heterogeneous data and training a homogeneous latent space with inspiration from the inverse-forward dynamics procedure. Leveraging the heterogeneous representations and datasets, HDyS achieves decent mapping between human kinematics and dynamics. We demonstrate the feasibility of HDyS with extensive experiments and applications. Our code would be made publicly available.
Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Wei-Jin Huang · Yuan-Ming Li · Zhi-Wei Xia · Yu-Ming Tang · Kun-Yu Lin · Jian-Fang Hu · Wei-Shi Zheng
Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Some existing methods can only detect errors in action labels, while others can only detect errors by comparing the actual action with static prototypes. Prototype-based methods overlook situations where more than one action is valid following a sequence of executed actions. This leads to two issues: not only can the model not effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training, but the model may also use the wrong prototypes to detect errors if the ongoing action's label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection.
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
Yiheng Li · RuiBing Hou · Hong Chang · Shiguang Shan · Xilin Chen
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Symbolic Representation for Any-to-Any Generative Tasks
Jiaqi Chen · Xiaoye Zhu · Yue Wang · Tianyang Liu · Xinhui Chen · Ying Chen · Chak Tou Leong · Yifei Ke · Joseph Liu · Yiwen Yuan · Julian McAuley · Li-jia Li
We propose a symbolic generative task description language and inference engine, capable of representing arbitrary multimodal tasks as symbolic flows.The inference engine maps natural language instructions to symbolic flow, eliminating the need for task-specific training.Conventional generative models rely heavily on large-scale training and implicit neural representation to learn cross-modal mappings, which demands extensive computational resources and restricts expandability. In this paper, we propose an explicit symbolic task descriptive language, comprising three types of primitives: functions, parameters, and topological logic. Using a pre-trained language model to infer symbolic workflows in a training-free manner, our framework successfully performs over 12 multimodal generative tasks based on user instructions, demonstrating enhanced efficiency and flexibility. Extensive experiments demonstrate that our approach can generate multimodal content competitive with, and often surpassing, that of previous state-of-the-art unified models, while offering robust interruptibility and editability. We believe that symbolic task representations are capable of cost-effectively expanding the boundaries of generative AI capabilities. All code and results are available in the Supplementary Materials.
SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
Zhengyuan Li · Kai Cheng · Anindita Ghosh · Uttaran Bhattacharya · Liangyan Gui · Aniket Bera
Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often resulting in misalignment between motion semantics and language instructions. In this paper, we introduce MotionDiT, an advanced Diffusion-Transformer-based motion editing model that effectively incorporates editing features both as layer-wise control signals and as input prefixes. To enhance the model's semantic understanding, we also propose a novel auxiliary task, motion similarity prediction, which fosters the learning of semantically meaningful representations. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in both editing alignment and fidelity.
AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models
Kwan Yun · Seokhyeon Hong · Chaelin Kim · Junyong Noh
Despite recent advancements in learning-based motion in-betweening, a key limitation has been overlooked: the requirement for character-specific datasets. In this work, we introduce AnyMoLe, a novel method that addresses this limitation by leveraging video diffusion models to generate motion in-between frames for arbitrary characters without external data. Our approach employs a two-stage frame generation process to enhance contextual understanding. Furthermore, to bridge the domain gap between real-world and rendered character animations, we introduce ICAdapt, a fine-tuning technique for video diffusion models. Additionally, we propose a ``motion-video mimicking'' optimization technique, enabling seamless motion generation for characters with arbitrary joint structures using 2D and 3D-aware features. AnyMoLe significantly reduces data dependency while generating smooth and realistic transitions, making it applicable to a wide range of motion in-betweening tasks.
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
Bizhu Wu · Jinheng Xie · Keming Shen · Zhe Kong · Jianfeng Ren · Ruibin Bai · Rong Qu · Linlin Shen
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing studies often focus on coarse-grained motion-text modeling, limiting their ability to handle fine-grained motion-relevant tasks. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text and motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Dataset and code will be released upon paper acceptance.
Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression
Zichong Meng · Yiming Xie · Xiaogang Peng · Zeyu Han · Huaizu Jiang
Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics.However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance.In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability.In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution.Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods.Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
Shunlin Lu · Jingbo Wang · Zeyu Lu · Ling-Hao Chen · Wenxun Dai · Junting Dong · Zhiyang Dou · Bo Dai · Ruimao Zhang
The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of 1e18. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss.
Multi-Object Tracking has been a long-standing challenge in video understanding. A natural and intuitive approach is to split this task into two parts: object detection and association. Most mainstream methods employ meticulously crafted heuristic techniques to maintain trajectory information and compute cost matrices for object matching. Although these methods can achieve notable tracking performance, they commonly encounter issues in complex scenarios, thereby often requiring a series of elaborate handcrafted modifications. We believe that manually assumed priors limit the method's adaptability and flexibility, preventing it from directly learning optimal tracking capabilities from domain-specific data. Therefore, we propose a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we proposed a straightforward method termed MOTIP. Without using tailored or sophisticated architectures, our method achieved state-of-the-art results across multiple benchmarks by solely leveraging object-level features as tracking cues. The simplicity and impressive results of MOTIP leave substantial room for future advancements, thereby making it a promising baseline for subsequent research.
Shape and Texture: What Influences Reliable Optical Flow Estimation?
Libo Long · Xiao Hu · Jochen Lang
Recent methods have made significant progress in optical flow estimation. However, the evaluation of these methods mainly focus on improved accuracy in benchmarks and often overlook the analysis of the robustness or behavior of the networks, which could be important in safety-critical scenarios such as autonomous driving. In this paper, we propose a novel method for robustness evaluation by modifying data from original benchmarks. Unlike previous benchmarks that focus on complex scenes, we propose to modify key objects from the original images in order to analyze the sensitivity to these changes observed in the output. Our aim is to identify common failure cases of state-of-the-art (SOTA) methods to evaluate their robustness and understand their behaviors. We show that: Optical flow methods are more sensitive to shape changes than to texture changes; and optical flow methods tend to “remember” objects seen during training and may “ignore” the motion of unseen objects. Our experimental results and findings provide a more in-depth understanding of the behavior of recent optical flow methods, suggesting the need for more careful design, especially in safety-critical scenarios. The code and data will be made available.
Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
Hanyu Zhou · Haonan Wang · Haoyue Liu · Yuxing Duan · Yi Chang · Luxin Yan
High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.
Unified Reconstruction of Static and Dynamic Scenes from Events
Qiyao Gao · Peiqi Duan · Hanyue Lou · Minggui Teng · Ziqi Cai · Xu Chen · Boxin Shi
This paper addresses the challenge that current event-based video reconstruction methods cannot produce static background information. Recent research has uncovered the potential of event cameras in capturing static scenes. Nonetheless, image quality deteriorates due to noise interference and detail loss, failing to provide reliable background information. We propose a two-stage reconstruction strategy to address these challenges and reconstruct static scene images comparable to frame cameras. Building on this, we introduce the URSEE framework, the first unified framework designed for reconstructing motion videos with static backgrounds. This framework includes a parallel channel that can simultaneously process static and dynamic events, and a network module designed to reconstruct videos encompassing both static and dynamic scenes in an end-to-end manner. We also collect a real-captured dataset for static reconstruction, containing both indoor and outdoor scenes. Comparison results indicate that the proposed approach achieves state-of-the-art reconstruction results on both synthetic and real data.
Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems
Alejandro Castañeda Garcia · Jan Warchocki · Jan van Gemert · Daan Brinks · Nergis Tomen
Extracting physical dynamical system parameters from recorded observations is key in natural science. Current methods for automatic parameter estimation from video train supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques--which depend on frame prediction--exist, they suffer from long training times, initialization instabilities, only consider motion-based dynamical systems, and are evaluated mainly on synthetic data. In this work, we propose an unsupervised method to estimate the physical parameters of known, continuous governing equations from single videos suitable for different dynamical systems beyond motion and robust to initialization. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute. We first evaluate our model on synthetic data, as commonly done. After which, we take the field closer to reality by recording our own real-world dataset of 75 videos for five different types of dynamical systems to evaluate our method and others. Our method compares favorably to others. We will release all data and code.
Generating 3D-Consistent Videos from Unposed Internet Photos
Gene Chou · Kai Zhang · Sai Bi · Hao Tan · Zexiang Xu · Fujun Luan · Bharath Hariharan · Noah Snavely
We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model’s ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms commercial models in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
AnimateAnything: Consistent and Controllable Animation for Video Generation
guojun lei · Chi Wang · Rong Zhang · Yikai Wang · Hong Li · Weiwei Xu
We propose a unified approach for video-controlled generation, enabling text-based guidance and manual annotations to control the generation of videos, similar to camera direction guidance. Specifically, we designed a two-stage algorithm. In the first stage, we convert all control information into frame-by-frame motion flows. In the second stage, we use these motion flows as guidance to control the final video generation. Additionally, to reduce instability in the generated videos caused by large motion variations (such as those from camera movement, object motion, or manual inputs), which can result in flickering or the intermittent disappearance of objects, we transform the temporal feature computation in the video model into frequency-domain feature computation. This is because frequency-domain signals better capture the essential characteristics of an image, and by ensuring consistency in the video's frequency-domain features, we can enhance temporal coherence and reduce flickering in the final generated video.
MotionPro: A Precise Motion Controller for Image-to-Video Generation
Zhongwei Zhang · Fuchen Long · Zhaofan Qiu · Yingwei Pan · Wu Liu · Ting Yao · Tao Mei
Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically regard the Gaussian filtered trajectory as sole motion control signal. Nevertheless, the flow approximation via Gaussian kernel limits the controllability of fine-grained movement, and commonly fails to disentangle object and camera moving. To alleviate these, we present MotionPro, a new recipe of region-wise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify exact target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories from multiple local regions to simulate inference scenario. Instead of approximating flow distributions generally using a large Gaussian kernel, our region-wise trajectory provides a more precise control by directly employing trajectories in local region and thus manages to characterize fine-grained movement. A motion mask is simultaneously derived from the predicted flow maps to present holistic motion dynamics. To pursue natural motion control, MotionPro further strengthens video denoising with additional conditions of region-wise trajectory and motion mask in a feature modulation manner. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro.
Generative Inbetweening through Frame-wise Conditions-Driven Video Generation
Tianyi Zhu · Dongwei Ren · Qilong Wang · Xiaohe Wu · Wangmeng Zuo
Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves. The source code will be publicly available.
FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
Jiangtong Tan · Hu Yu · Jie Huang · Jie Xiao · Feng Zhao
Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to inconsistency and poor quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements.
Probability Density Geodesics in Image Diffusion Latent Space
Qingtao Yu · Jaskirat Singh · Zhaoyuan Yang · Peter Henry Tu · Jing Zhang · Richard Hartley · Hongdong Li · Dylan Campbell
Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.
Bias for Action: Video Implicit Neural Representations with Bias Modulation
Alper Kayabasi · Anil Kumar Vadathya · Guha Balakrishnan · Vishwanath Saragadam
We propose a new continuous video modeling framework based on implicit neural representations (INRs) called \textbf{ActINR}. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR's biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including 10x video slow motion, 4x spatial super resolution along with 2x slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.
BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
Eunjin Kim · HYEONJIN KIM · Kyong Hwan Jin · Jaejun Yoo
Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve—and even degrade—performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency. Our code will be available soon.
FLAVC: Learned Video Compression with Feature Level Attention
Chun Zhang · Heming Sun · Jiro Katto
Learned Video Compression (LVC) aims to reduce redundancy in sequential data through deep learning approaches. Recent advances have significantly boosted LVC performance by shifting compression operations to feature domain, often combining Motion Estimation and Motion Compensation module(MEMC) with CNN-based context extraction. However, reliance on motions and convolution-driven context models limits generalizability and global perception. To address these issues, we propose a Feature-level Attention (FLA) module within a Transformer-based framework that perceives full-frame explicitly, thus bypassing confined motion signatures. FLA accomplishes global perception by converting high-level local patch embeddings to one-dimensional batch-wise vectors and replacing traditional attention weights to a global context matrix. Amongst this, a dense overlapping patcher (DP) is introduced to retain local features before embedding projection. Furthermore, a Transformer-CNN mixed encoder is applied to alleviate the spatial feature bottleneck without expanding latent size. Experiments demonstrate excellent generalizability with universally efficient redundancy reduction in different scenarios. Extensive tests on four video compression datasets show that our method achieves state-of-the-art Rate-Distortion performance compared to existing LVC methods and traditional codecs. A down-scaled version of our model reduced computation overhead by a great margin while maintained great performance.
ProReflow: Progressive Reflow with Decomposed Velocity
Lei Ke · Haohang Xu · Xuefei Ning · Yu Li · Jiajun Li · Haoling Li · Yuxuan Lin · Dongsheng Jiang · Yujiu Yang · Linfeng Zhang
Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Our experimental result on SDv1.5 demonstrates our method achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, closed to our teacher model (32 DDIM steps, FID = 10.05). Our codes will be released at Github.
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
Yudong Mao · Hao Luo · Zhiwei Zhong · Peilin CHEN · Zhijiang Zhang · Shiqi Wang
Unlike modern native digital videos, the restoration of old films requires addressing specific degradations inherent to analog sources. However, existing specialized methods still fall short compared to general video restoration techniques. In this work, we propose a new baseline to re-examine the challenges in old film restoration. First, we develop an improved Mamba-based framework, dubbed MambaOFR, which can dynamically adjust the degradation removal patterns by generating degradation-aware prompts to tackle the complex and composite degradations present in old films. Second, we introduce a flow-guided mask deformable alignment module to mitigate the propagation of structured defect features in the temporal domain. Third, we introduce the first benchmark dataset that includes both synthetic and real-world old film clips. Extensive experiments show that the proposed method achieves state-of-the-art performance, outperforming existing advanced approaches in old film restoration. The implementation and model will be released.
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
Rohit Kundu · Hao Xiong · Vishal Mohanty · Athula Balachandran · Amit K. Roy-Chowdhury
Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches.To address this, we introduce the Universal Network for Identifying Tampered and Engineered videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model’s tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.
A Polarization-Aided Transformer for Image Deblurring via Motion Vector Decomposition
Duosheng Chen · Shihao Zhou · Jinshan Pan · Jinglei Shi · lishen qu · Jufeng Yang
Effectively leveraging motion information is crucial for the image deblurring task. Existing methods typically build deep-learning models to restore a clean image by estimating blur patterns over the entire movement. This suggests that the blur caused by rotational motion components is processed together with the translational one. Exploring the movement without separation leads to limited performance for complex motion deblurring, especially rotational motion. In this paper, we propose Motion Decomposition Transformer (MDT), a transformer-based architecture augmented with polarized modules for deblurring via motion vector decomposition. MDT consists of a Motion Decomposition Module (MDM) for extracting hybrid rotation and translation features, and a Radial Stripe Attention Solver (RSAS) for sharp image reconstruction with enhanced rotational information. Specifically, the MDM uses a deformable Cartesian convolutional branch to capture translational motion, complemented by a polar-system branch to capture rotational motion. The RSAS employs radial stripe windows and angular relative positional encoding in the polar system to enhance rotational information. This design preserves translational details while keeping computational costs lower than dual-coordinate design. Experimental results on 6 image deblurring datasets show that MDT outperforms state-of-the-art methods, particularly in handling blur caused by complex motions with significant rotational components.
Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
Siwei Tu · Ben Fei · Weidong Yang · Fenghua Ling · Hao Chen · Zili Liu · Kun Chen · Hang Fan · Wanli Ouyang · Lei Bai
Accurate acquisition of surface meteorological conditions at arbitrary locations holds significant importance for weather forecasting and climate simulation. Due to the fact that meteorological states derived from satellite observations are often provided in the form of low-resolution grid fields, the direct application of spatial interpolation to obtain meteorological states for specific locations often results in significant discrepancies when compared to actual observations. Existing downscaling methods for acquiring meteorological state information at higher resolutions commonly overlook the correlation with satellite observations. To bridge the gap, we propose $\textbf{S}$atellite-observations $\textbf{G}$uided $\textbf{D}$iffusion Model ($\textbf{SGD}$), a conditional diffusion model pre-trained on ERA5 reanalysis data with satellite observations (GridSat) as conditions, which is employed for sampling downscaled meteorological states through a zero-shot guided sampling strategy and patch-based methods. During the training process, we propose to fuse the information from GridSat satellite observations into ERA5 maps via the attention mechanism, enabling SGD to generate atmospheric states that align more accurately with actual conditions. In the sampling, we employed optimizable convolutional kernels to simulate the upscale process, thereby generating high-resolution ERA5 maps using low-resolution ERA5 maps as well as observations from weather stations as guidance. Moreover, our devised patch-based method promotes SGD to generate meteorological states at arbitrary resolutions. Experiments demonstrate SGD fulfills accurate meteorological states downscaling to 6.25km.
Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark
Zhuoran Du · Shaodi You · Cheng Cheng · Shikui Wei
Hyperspectral image (HSI) densely samples the world in both the space and frequency domain and therefore is more distinctive than RGB images. Usually, HSI needs to be calibrated to minimize the impact of various illumination conditions. The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility.These limitations inspire this paper to automatically calibrate HSIs using a learning-based method.Towards this goal, a large-scale HSI calibration dataset is created, which has 765 high-quality HSI pairs covering diversified natural scenes and illuminations. The dataset is further expanded to 7650 pairs by combining with 10 different physically measured illuminations.A spectral illumination transformer (SIT) together with an illumination attention module is proposed. Extensive benchmarks demonstrate the SoTA performance of the proposed SIT. The benchmarks also indicate that low-light conditions are more challenging than normal conditions.The dataset and codes are anonymously available online: https://anonymous.4open.science/r/Automatic-spectral-calibration-of-HSI-0C5A
VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond
Dabing Yu · Zheng Gao
Capitalizing on the talent of self-attention in capturing non-local features, Transformer architectures have exhibited remarkable performance in single hyperspectral image restoration. For hyperspectral images, each pixel is located in the hyperspectral image cubes with a large spectral dimension and two spatial dimensions. Although uni-dimensional self-attention, like channel self-attention or spatial self-attention, builds long-range dependencies in spectral or spatial dimensions, they lack more comprehensive interactions across dimensions. To tackle the above drawback, we propose a VolFormer, a volumetric self-attention embedded Transformer network for single hyperspectral image restoration. Specifically, we propose volumetric self-attention (VolSA), which extends the interaction from 2D flat to 3D cube. VolSA can simultaneously model token interaction in the 3D cube, mining the potential correlations between the hyperspectral image cube. An attention decomposition form is proposed to reduce the computational burden of modeling volumetric information. In practical terms, VolSA adapts double similarity matrixes in spatial and channel dimensions to implicitly model 3D context information while transforming the complexity from cubic to quadratic. Additionally, we introduce the explicit spectral location prior to enhance the proposed self-attention. This property allows the target token to perceive global spectral information while simultaneously assigning different levels of attention to tokens at varying wavelength bands. Extensive experiments demonstrate that VolFormer achieves record-high performance on hyperspectral image super-resolution, denoise and classification benchmarks. Particularly, VolSA is portable and achieves inspiring results in hyperspectral classification. The source code is available in the supplementary material.
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
Chunyang Cheng · Tianyang Xu · Zhenhua Feng · Xiaojun Wu · Zhangyong Tang · Hui Li · Zhang Zeyang · Sara Atito · Muhammad Awais · Josef Kittler
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be released.
Continuous Adverse Weather Removal via Degradation-Aware Distillation
Xin Lu · Jie Xiao · Yurui Zhu · Xueyang Fu
All-in-one models for adverse weather removal aim to process various degraded images using a single set of parameters, making them ideal for real-world scenarios. However, they encounter two main challenges: catastrophic forgetting and limited degradation awareness. The former causes the model to lose knowledge of previously learned scenarios, reducing its overall effectiveness. While the later hampers the model’s ability to accurately identify and respond to specific types of degradation, limiting its performance across diverse adverse weather conditions. To address these issues, we introduce the Incremental Learning Adverse Weather Removal (ILAWR) framework, which uses a novel degradation-aware distillation strategy for continuous weather removal. Specifically, we first design a degradation-aware module that utilizes Fourier priors to capture a broad range of degradation features, effectively mitigating catastrophic forgetting in low-level visual tasks. Then, we implement multilateral distillation, which combines knowledge from multiple teacher models using an importance-guided aggregation approach. This enables the model to balance adaptation to new degradation types with the preservation of background details. Extensive experiments on both synthetic and real-world datasets confirm that ILAWR outperforms existing models across multiple benchmarks, proving its effectiveness in continuous adverse weather removal.
MambaIRv2: Attentive State Space Restoration
Hang Guo · Yong Guo · Yaohua Zha · Yulun Zhang · Wenbo Li · Tao Dai · Shu-Tao Xia · Yawei Li
The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by even 0.35dB PSNR for lightweight SR even with 9.3% less parameters and suppresses HAT on classic SR by up to 0.29dB.
TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond
Kun Zhou · Xinyu Lin · Jiangbo Lu
Recently, Mamba-based frameworks have achieved substantial advancements across diverse computer vision and NLP tasks, particularly in their capacity for reasoning over long-range information with linear complexity. However, the fixed 2D-to-1D scanning pattern overlooks the local structures of an image, limiting its effectiveness in aggregating 2D spatial information. While stacking additional Mamba layers can partially address this issue, it increases parameter intensity and constrains real-time application. In this work, we reconsider the local optimal scanning path in Mamba, enhancing the rigid and uniform 1D scan through the local shortest path theory, thus creating a structure-aware Mamba suited for lightweight single-image super-resolution. Specifically, we draw inspiration from the Traveling Salesman Problem (TSP) to establish a local optimal scanning path for improved structural 2D information utilization. Here, local patch aggregation occurs in a content-adaptive manner with minimal propagation cost. TSP-Mamba demonstrates substantial improvements over existing Mamba-based and Transformer-based architectures. For example, TSP-Mamba surpasses MambaIR by up to 0.7dB in lightweight SISR, with comparable parameters and very slightly extra computational demands (1-2 GFlops for 720P images).
Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
Shangquan Sun · Wenqi Ren · Juxiang Zhou · Shu Wang · Jianhou Gan · Xiaochun Cao
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we integrate a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks. Our code will be made publicly available.
GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration
Sudarshan Rajagopalan · Nithin Gopalakrishnan Nair · Jay Paranjape · Vishal M. Patel
Deep learning–based models for All-In-One image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model, capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over $550$k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over $750$k samples. Our experiments reveal that image restoration models trained on GenDS dataset exhibit significant improvements in out-of-distribution performance as compared to when trained solely on existing datasets. Furthermore, we provide comprehensive analyses on implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.
Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise
Brayan Monroy · Jorge Bacca · Julián Tachella
Recorrupted-to-Recorrupted (R2R) has emerged as a methodology for training deep networks for image restoration in a self-supervised manner from noisy measurement data alone, demonstrating equivalence in expectation to the supervised squared loss in the case of Gaussian noise. However, its effectiveness with non-Gaussian noise remains unexplored. In this paper, we propose Generalized R2R (GR2R), extending the R2R framework to handle a broader class of noise distribution as additive noise like log-Rayleigh and address the natural exponential family including Poisson, Gamma and Binomial noise distributions, which play a key role in many applications including low-photon imaging and synthetic aperture radar. We show that the GR2R loss is an unbiased estimator of the supervised loss and that the popular Stein's unbiased risk estimator can be seen as a special case. A series of experiments with Gaussian, Poisson, and Gamma noise validate GR2R's performance, showing its effectiveness compared to other self-supervised methods.
Degradation-Aware Feature Perturbation for All-in-One Image Restoration
Xiangpeng Tian · Xiangyu Liao · Xiao Liu · Meng Li · Chao Ren
All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture.Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. All the source code and trained models will be made available to the public.
Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
Guanglu Dong · Xiangyu Liao · Mingyang Li · Guihuan Guo · Chao Ren
Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world SISR, and OU NR-IQA tasks, demonstrate the effectiveness of our proposed methods.
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
Junyang Chen · Jinshan Pan · Jiangxin Dong
Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
Zhu Liu · Zijun Wang · Jinyuan Liu · Fanqi Meng · Long Ma · Risheng Liu
Thermal imaging is often compromised by dynamic, complex degradations caused by hardware limitations and unpredictable environmental factors. The scarcity of high-quality infrared data, coupled with the challenges of dynamic, intricate degradations, makes it difficult to recover details using existing methods. In this paper, we introduce thermal degradation simulation integrated into the training process via a mini-max optimization, by modeling these degraded factors as adversarial attacks on thermal images. The simulation is dynamic to maximize objective functions, thus capturing a broad spectrum of degraded data distributions. This approach enables training with limited data, thereby improving model performance.Additionally, we introduce a dual-interaction network that combines the benefits of spiking neural networks with scale transformation to capture degraded features with sharp spike signal intensities. This architecture ensures compact model parameters while preserving efficient feature representation. Extensive experiments demonstrate that our method not only achieves superior visual quality under diverse single and composited degradation, but also delivers a significant reduction in processing when trained on only fifty clear images, outperforming existing techniques in efficiency and accuracy.
Adversarial Diffusion Compression for Real-World Image Super-Resolution
Bin Chen · Gehui Li · Rongyuan Wu · Xindong Zhang · Jie Chen · Jian Zhang · Lei Zhang
Real-world image super-resolution (Real-ISR) aims to reconstruct high-resolution images from low-resolution inputs degraded by complex, unknown processes. While many Stable Diffusion (SD)-based Real-ISR methods have achieved remarkable success, their slow, multi-step inference hinders practical deployment. Recent SD-based one-step networks like OSEDiff and S3Diff alleviate this issue but still incur high computational costs due to their reliance on large pretrained SD models. This paper proposes a novel Real-ISR method, **AdcSR**, by distilling the one-step diffusion network OSEDiff into a streamlined diffusion-GAN model under our **A**dversarial **D**iffusion **C**ompression (**ADC**) framework. We meticulously examine the modules of OSEDiff, categorizing them into two types: **(1) Removable** (VAE encoder, prompt extractor, text encoder, *etc.*) and **(2) Prunable** (denoising UNet and VAE decoder). Since direct removal and pruning can degrade the model's generation capability, we pretrain our pruned VAE decoder to restore its ability to decode images and employ adversarial distillation to compensate for performance loss. This ADC-based diffusion-GAN hybrid design effectively reduces complexity by 73\% in inference time, 78\% in computation, and 74\% in parameters, while preserving the model’s generation capability. Experiments manifest that our proposed AdcSR achieves competitive recovery quality on both synthetic and real-world datasets, offering up to 9.3$\times$ speedup over previous one-step diffusion-based methods. Code and models will be made available.
All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising
Xiaoling Zhou · Zhemg Lee · Wei Ye · Rui Xie · Wenbo Zhang · Guanju Peng · Zongze Li · Shikun Zhang
Image denoising poses a significant challenge in image processing, aiming to remove noise and artifacts from input images. However, current denoising algorithms implemented on electronic chips frequently encounter latency issues and demand substantial computational resources. In this paper, we introduce an all-optical Nonlinear Diffractive Denoising Deep Network (N3DNet) for image denoising at the speed of light. Initially, we incorporate an image encoding and pre-denoising module into the Diffractive Deep Neural Network and integrate a nonlinear activation function, termed the phase exponential linear function, after each diffractive layer, thereby boosting the network's nonlinear modeling and denoising capabilities. Subsequently, we devise a new reinforcement learning algorithm called regularization-assisted deep Q-network to optimize N3DNet. Finally, leveraging 3D printing techniques, we fabricate N3DNet using the trained parameters and construct a physical experimental system for real-world applications. A new benchmark dataset, termed MIDD, is constructed for mode image denoising, comprising 120K pairs of noisy/noise-free images captured from real fiber communication systems across various transmission lengths. Through extensive simulation and real experiments, we validate that N3DNet outperforms both traditional and deep learning-based denoising approaches across various datasets. Remarkably, its processing speed is nearly 3,800 times faster than electronic chip-based methods.
Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators
Bohan Xiao · PEIYONG WANG · Qisheng He · Ming Dong
Image-to-Image (I2I) translation involves converting an im- age from one domain to another. Deterministic I2I transla- tion, such as in image super-resolution, extends this con- cept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denois- ing Brownian bridge model with dual approximators (Dual- approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse pro- cess) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive exper- iments on benchmark datasets including image generation and super-resolution demonstrate the consistent and supe- rior performance of Dual-approx Bridge in terms of im- age quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: https://github.com/bohan95/dual-app-bridge
Towards Smart Point-and-Shoot Photography
Jiawan Li · Fei Zhou · Zhipeng Zhong · Jiongzhi Lin · Guoping Qiu
Hundreds of millions of people routinely take photos using their smartphones as point and shoot (PAS) cameras, yet very few would have the photography skills to compose a good shot of a scene. While traditional PAS cameras have built-infunctions to ensure a photo is well focused and has the right brightness, they cannot tell the users how to compose the best shot of a scene. In this paper, we present a first of its kind smart point and shoot (SPAS) system to help users to take good photos. Our SPAS proposes to help users to compose a good shot of a scene by automatically guiding the users to adjust the camera pose live on the scene. We first constructed a large dataset containing $320K$ images with camera pose information from 4000 scenes. We then developed an innovative CLIP-based Composition Quality Assessment (CCQA) model to assign pseudo labels to these images. The CCQA introduces a unique learnable text embedding technique to learn continuous word embeddings capable of discerning subtle visual quality differences in the range covered by five levels of quality description words {$bad, poor, fair, good, perfect$}. And finally we have developed a camera pose adjustment model (CPAM) which first determines if the current view can be further improved and if so it outputs the adjust suggestion in the form of two camera pose adjustment angles. The two tasks of CPAM make decisions in a sequential manner and each involves different sets of training samples, we have developed a mixture-of-experts model with a gated loss function to train the CPAM in an end-to-end manner. We will present extensive results to demonstrate the performances of our SPAS system using publicly available image composition datasets.
MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis
Tianyu Wang · Jianming Zhang · Haitian Zheng · Zhihong Ding · Scott Cohen · Zhe Lin · Wei Xiong · Chi-Wing Fu · Luis Figueroa · Soo Ye Kim
Shadows are often underconsidered or even ignored in image editing applications, limiting the realism of the edited results. In this paper, we introduce MetaShadow, a three-in-one versatile framework that enables detection, removal, and controllable synthesis of shadows in natural images in an object-centered fashion. MetaShadow combines the strengths of two cooperative components: Shadow Analyzer, for object-centered shadow detection and removal, and Shadow Synthesizer, for reference-based controllable shadow synthesis. Notably, we optimize the learning of the intermediate features from Shadow Analyzer to guide Shadow Synthesizer to generate more realistic shadows that blend seamlessly with the scene. Extensive evaluations on multiple shadow benchmark datasets show significant improvements of MetaShadow over the existing state-of-the-art methods on object-centered shadow detection, removal, and synthesis. MetaShadow excels in supporting imageediting tasks such as object removal, relocation, and insertion, pushing the boundaries of object-centered image editing.
Erasing Undesirable Influence in Diffusion Models
Jing Wu · Trung Le · Munawar Hayat · Mehrtash Harandi
Diffusion models are highly effective at generating high-quality images but pose risks, such as the unintentional generation of NSFW (not safe for work) content.Although various techniques have been proposed to mitigate unwanted influences in diffusion models while preserving overall performance, achieving a balance between these goals remains challenging.In this work, we introduce EraseDiff, an algorithm designed to preserve the utility of the diffusion model on retained data while removing the unwanted information associated with the data to be forgotten.Our approach formulates this task as a constrained optimization problem using the value function, resulting in a natural first-order algorithm for solving the optimization problem.By altering the generative process to deviate away from the ground-truth denoising trajectory, we update parameters for preservation while controlling constraint reduction to ensure effective erasure, striking an optimal trade-off.Extensive experiments and thorough comparisons with state-of-the-art algorithms demonstrate that EraseDiff effectively preserves the model's utility, efficacy, and efficiency.
EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion
Yixing Zhu · Qing Zhang · Yitong Wang · Yongwei Nie · Wei-Shi Zheng
This paper presents EntityErasure, a novel diffusion-based method that can effectively erase entity without inducing unwanted sundries. To this end, we propose to address this problem by dividing it into amodal entity segmentation and completion, such that the region to inpaint takes only entities in the non-inpainting area as reference, avoiding the possibility to generate unpredictable sundries. Moreover, we propose two novel metrics, for assessing the quality of object erasure based on entity segmentation, which are shown be more effective than existing metrics. Experimental results demonstrate that our approach significantly outperforms other state-of-the-art object erasure methods.
ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
Ji Woo Hong · Tri Ton · Trung X. Pham · Gwanhyeong Koo · Sunjae Yoon · Chang D. Yoo
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead.A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image.Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.
Latent Space Imaging
Matheus Souza · Yidan Zheng · Kaizhang Kang · Yogeshwar Nath Mishra · Qiang Fu · Wolfgang Heidrich
Digital imaging systems have traditionally relied on brute-force measurement and processing of pixels arranged on regular grids. In contrast, the human visual system performs significant data reduction from the large number of photoreceptors to the optic nerve, effectively encoding visual information into a low-bandwidth latent space representation optimized for brain processing. Inspired by this, we propose a similar approach to advance artificial vision systems.Latent Space Imaging introduces a new paradigm that combines optics and software to encode image information directly into the semantically rich latent space of a generative model. This approach substantially reduces bandwidth and memory demands during image capture and enables a range of downstream tasks focused on the latent space.We validate this principle through an initial hardware prototype based on a single-pixel camera. By implementing an amplitude modulation scheme that encodes into the generative model's latent space, we achieve compression ratios ranging from 1:100 to 1:1000 during imaging, and up to 1:16384 for downstream applications. This approach leverages the model's intrinsic linear boundaries, demonstrating the potential of latent space imaging for highly efficient imaging hardware, adaptable future applications in high-speed imaging, and task-specific cameras with significantly reduced hardware complexity.
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Lei Chen · Yuan Meng · Chen Tang · Xinzhu Ma · Jingyan Jiang · Xin Wang · Zhi Wang · Wenwu Zhu
Recent advancements in diffusion models, particularly the architectural transformation from UNet-based models to Diffusion Transformers (DiTs), significantly improve the quality and scalability of image and video generation. However, despite their impressive capabilities, the substantial computational costs of these large-scale models pose significant challenges for real-world deployment. Post-Training Quantization (PTQ) emerges as a promising solution, enabling model compression and accelerated inference for pretrained models, without the costly retraining. However, research on DiT quantization remains sparse, and existing PTQ frameworks, primarily designed for traditional diffusion models, tend to suffer from biased quantization, leading to notable performance degradation. In this work, we identify that DiTs typically exhibit significant spatial variance in both weights and activations, along with temporal variance in activations. To address these issues, we propose Q-DiT, a novel approach that seamlessly integrates two key techniques: automatic quantization granularity allocation to handle the significant variance of weights and activations across input channels, and sample-wise dynamic activation quantization to adaptively capture activation changes across both timesteps and samples. Extensive experiments conducted on ImageNet and VBench demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W6A8 on ImageNet ($256 \times 256$), Q-DiT achieves a remarkable reduction in FID by 1.09 compared to the baseline. Under the more challenging W4A8 setting, it maintains high fidelity in image and video generation, establishing a new benchmark for efficient, high-quality quantization in DiTs.
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Sotiris Anagnostidis · Gregor Bachmann · Yeongmin Kim · Jonas Kohler · Markos Georgopoulos · Artsiom Sanakoyeu · Yuming Du · Albert Pumarola · Ali Thabet · Edgar Schoenfeld
Despite their remarkable performance, modern Diffusion Transformers (DiTs) are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into flexible ones --- dubbed FlexiDiT --- allowing them to process inputs at varying compute budgets. We demonstrate how a single flexible model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.
Consistency Posterior Sampling for Diverse Image Synthesis
Vishal Purohit · Matthew Repasky · Jianfeng Lu · Qiang Qiu · Yao Xie · Xiuyuan Cheng
Posterior sampling in high-dimensional spaces using generative models holds significant promise for various applications, including but not limited to inverse problems and guided generation tasks. Generating diverse posterior samples remains expensive, as existing methods require restarting the entire generative process for each new sample. In this work, we propose a posterior sampling approach that simulates Langevin dynamics in the noise space of a pre-trained generative model. By exploiting the mapping between the noise and data spaces which can be provided by distilled flows or consistency models, our method enables seamless exploration of the posterior without the need to re-run the full sampling chain, drastically reducing computational overhead. Theoretically, we prove a guarantee for the proposed noise-space Langevin dynamics to approximate the posterior, assuming that the generative model sufficiently approximates the prior distribution. Our framework is experimentally validated on image restoration tasks involving noisy linear and nonlinear forward operators applied to LSUN-Bedroom (256 x 256) and ImageNet (64 x 64) datasets. The results demonstrate that our approach generates high-fidelity samples with enhanced semantic diversity even under a limited number of function evaluations, offering superior efficiency and performance compared to existing diffusion-based posterior sampling techniques.
Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data
Wenxin Su · Song Tang · Xiaofeng Liu · Xiaojing Yi · Mao Ye · Chunxiao Zu · Jiahao Li · Xiatian Zhu
Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way—learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
Johannes Schusterbauer · Ming Gui · Frank Fundel · Björn Ommer
Recent advancements in diffusion models have established new benchmarks in both generative tasks and downstream applications. In contrast, flow matching models have shown promising improvements in performance but have not been as extensively explored, particularly due to the difficulty of inheriting knowledge from a pretrained diffusion prior foundation model.In this work, we propose a novel method to bridge the gap between pretrained diffusion models and flow matching models by aligning their trajectories and matching their objectives. Our approach mathematically formalizes this alignment and enables the efficient transfer of knowledge from diffusion priors to flow matching models. We demonstrate that our method outperforms traditional diffusion and flow matching finetuning, achieving competitive results across a variety of tasks.
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
Hao Chen · Ze Wang · Xiang Li · Ximeng Sun · Fangyi Chen · Jiang Liu · Jindong Wang · Bhiksha Raj · Zicheng Liu · Emad Barsoum
Efficient image tokenization with high compression ratios remains a critical challenge for training generative models.We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256$\times$256 and 512$\times$512 images using only 32 or 64 1-dimensional tokens.Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256$\times$256 images and 55x for 512$\times$512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL.It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models.Code and model will be released.
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
YONGWEI CHEN · Yushi Lan · Shangchen Zhou · Tengfei Wang · Xingang Pan
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just $0.82$ seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content.Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
Aneeshan Sain · Subhajit Maity · Pinaki Nath Chowdhury · Subhadeep Koley · Ayan Kumar Bhunia · Yi-Zhe Song
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by $97.96$\% percent and $84.89$\% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of $99.37$\% of FLOPs (from $40.18$G to $0.254$G) when compared with a full network, while retaining the accuracy ($33.03$\% vs $32.77$\%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Hmrishav Bandyopadhyay · Yi-Zhe Song
Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation -- just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.
ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
Ozgur Kara · Krishna Kumar Singh · Feng Liu · Duygu Ceylan · James Rehg · Tobias Hinz
Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines.
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Qifan Yu · Wei Chow · Zhongqi Yue · Kaihang Pan · Yang Wu · Xiaoyang Wan · Juncheng Li · Siliang Tang · Hanwang Zhang · Yueting Zhuang
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity. The code is available in \url{https://anonymous.4open.science/r/AnyEdit-C53B}.
VIRES: Video Instance Repainting via Sketch and Text Guided Generation
Shuchen Weng · Haojie Zheng · Peixuan Zhang · Yuchen Hong · Han Jiang · Si Li · Boxin Shi
We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
Yixuan Zhu · Haolin Wang · Shilin Ma · Wenliang Zhao · Yansong Tang · Lei Chen · Jie Zhou
Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE—a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component’s specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively.
PICD: Versatile Perceptual Image Compression with Diffusion Rendering
Tongda Xu · Jiahao Li · Bin Li · Yan Wang · Ya-Qin Zhang · Yan Lu
Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, existing methods often produce noticeable artifacts when compressing text in screen content. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (\textbf{PICD}), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.
Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.
Geometry in Style: 3D Stylization via Surface Normal Deformation
Nam Anh Dinh · Itai Lang · Hyunwoo Kim · Oded Stein · Rana Hanocka
In this work, we present Geometry in Style, a new method for identity-preserving mesh stylization. Existing techniques either adhere to the original shape through overly restrictive deformations such as bump maps or significantly modify the input shape using expressive deformations that may introduce artifacts or alter the identity of the source shape. In contrast, we represent a deformation of a triangle mesh as a target normal vector for each vertex neighborhood. The deformations we recover from target normals are expressive enough to enable detailed stylizations and at the same time restrictive enough to preserve the shape's identity. We achieve such deformations using our novel differentiable As-Rigid-As-Possible (dARAP) layer, a neural-network-ready adaptation of the classical ARAP algorithm which we use to solve for per-vertex rotations and deformed vertices. As a differentiable layer, dARAP is paired with a visual loss from a text-to-image model to drive deformations toward style prompts, altogether giving us Geometry in Style.
SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer
Hongda Liu · Longguang Wang · Ye Zhang · Ziru YU · Yulan Guo
Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
Pengcheng Xu · Boyuan Jiang · Xiaobin Hu · Donghao Luo · Qingdong He · Jiangning Zhang · Chengjie Wang · Yunsheng Wu · Charles Ling · Boyu Wang
Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types. Experiments on various scenarios demonstrate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
Toan Nguyen · Kien Do · Duc Kieu · Thin Nguyen
We introduce a theoretical framework for diffusion-based image editing by formulating it as a reverse-time bridge modeling problem. This approach modifies the backward process of a pretrained diffusion model to construct a bridge that converges to an implicit distribution associated with the editing target at time 0. Building on this framework, we propose h-Edit, a novel editing method that utilizes Doob's h-transform and Langevin Monte Carlo to decompose the update of an intermediate edited sample into two components: a "reconstruction" term and an "editing" term. This decomposition provides flexibility, allowing the reconstruction term to be computed via existing inversion techniques and enabling the combination of multiple editing terms to handle complex editing tasks. To our knowledge, h-Edit is the first training-free method capable of performing simultaneous text-guided and reward-model-based editing. Extensive experiments, both quantitative and qualitative, show that h-Edit outperforms state-of-the-art baselines in terms of editing effectiveness and faithfulness.
Concept Lancet: Image Editing with Compositional Representation Transplant
Jinqi Luo · Tianjiao Ding · Kwan Ho Ryan Chan · Hancheng Min · Chris Callison-Burch · Rene Vidal
Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure (e.g., Cat $\rightarrow$ Dog, Sketch $\rightarrow$ Painting) by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts and phrases. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace, add, or remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual concepts and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Sherry X. Chen · Misha Sra · Pradeep Sen
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix (IP2P) dataset and get over 60K refined samples we then use to fine-tune the IP2P model, guided by our novel Instruct-CLIP-based loss function. The resulting model produces better edits that are more aligned with the given instructions, and visibly and quantitatively outperforms state-of-the-art approaches. Our code and dataset will be released upon acceptance of the paper.
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing
Tong Wang · Ting Liu · Xiaochao Qu · WU CHENGJING · Luoqi Liu · Xiaolin Hu
Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28\%.
DreamOmni: Unified Image Generation and Editing
Bin Xia · Yuechen Zhang · Jingyao Li · Chengyao Wang · Yitong Wang · Xinglong Wu · Bei Yu · Jiaya Jia
Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.
Black Hole-Driven Identity Absorbing in Diffusion Models
Muhammad Shaheryar · Jong Taek Lee · Soon Ki Jung
Recent advances in diffusion models have positioned them as powerful generative frameworks for high-resolution image synthesis across diverse domains. The emerging "h-space" within these models, defined by bottleneck activations in the denoiser, offers promising pathways for semantic image editing similar to GAN latent spaces. However, as demand grows for content erasure and concept removal, privacy concerns highlight the need for identity disentanglement in the latent space of diffusion models. The high-dimensional latent space poses challenges for identity removal, as traversing with random or orthogonal directions often leads to semantically unvalidated regions, resulting in unrealistic outputs.To address these issues, we propose $\textbf{B}$lack $\textbf{H}$ole-Driven $\textbf{I}$dentity $\textbf{A}$bsorption (BIA) within the latent space of diffusion models for any identity erasure. BIA uses a "black hole" metaphor, where the latent region representing a specified identity acts as an attractor, drawing in nearby latent points of surrounding identities to "wrap" the black hole. Instead on relying on random traversals for optimization, BIA employs an identity absorption mechanism by attracting and wrapping nearby validated latent points associated with other identities to achieve a vanishing effect for specified identity. Our method effectively prevents the generation of a specified identity while preserving other attributes, as validated by improved scores on identity similarity (SID), FID metrics, qualitative evaluations, and user studies as compared to SOTA.
DreamText: High Fidelity Scene Text Synthesis
Yibin Wang · Weizhong Zhang · honghui xu · Cheng Jin
Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training.Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications.Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios.To this end, this paper proposes DreamText for high-fidelity scene text synthesis.Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions.This transformation poses a hybrid optimization challenge, involving both discrete and continuous variables. To effectively tackle this challenge, we employ a heuristic alternate optimization strategy. Meanwhile, we jointly train the text encoder and generator to comprehensively learn and utilize the diverse font present in the training dataset. This joint training is seamlessly integrated into the alternate optimization process, fostering a synergistic relationship between learning character embedding and re-estimating character attention.Specifically, in each step, we first encode potential character-generated position information from cross-attention maps into latent character masks. These masks are then utilized to update the representation of specific characters in the current step, which, in turn, enables the generator to correct the character's attention in the subsequent steps.Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art.
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images
Yasamin Medghalchi · Moein Heidari · Clayton Allard · Leonid Sigal · Ilker Hacihaliloglu
Deep neural networks (DNNs) offer significant promise for improving breast cancer diagnosis in medical imaging. However, these models are highly susceptible to adversarial attacks—small, imperceptible changes that can mislead classifiers—raising critical concerns about their reliability and security. Traditional attack methods typically either require substantial extra data for malicious model pre-training, or involve a fixed norm perturbation budget, which does not align with human perception of these alterations. In medical imaging, however, this is often unfeasible due to the limited availability of datasets. Building on recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a novel language-guided semantic attack method capable of generating meaningful perturbations driven by text instructions. During the prompt learning phase, our approach leverages learnable prompts within the text encoder to create subtle, yet impactful, perturbations that remain imperceptible while guiding the model towards targeted outcomes.In contrast to current prompt learning-based approaches, our P2P stands out by directly updating text embeddings, avoiding the need for retraining diffusion models or using large pre-trained models which is typically infeasible in medical domain. Further, we leverage the finding that optimizing only the early reverse diffusion steps boosts efficiency while ensuring that the generated adversarial examples incorporate subtle low-frequency noise, thus preserving ultrasound image quality without introducing noticeable artifacts. We show that our method outperforms state-of-the-art attack techniques across three breast ultrasound datasets. Moreover, the generated images are both more natural in appearance and more effective compared to existing adversarial attacks.
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Andrew Z Wang · Songwei Ge · Tero Karras · Ming-Yu Liu · Yogesh Balaji
Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders.In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 22 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes.Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance.Instead, we explore embeddings from various layers and find that usinglayer-normalized averaging across all layers significantly improves alignment with complex prompts. LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Bingda Tang · Sayak Paul · Boyang Zheng · Saining Xie
Recent advances in text-to-image synthesis have delivered impressive results, yet existing approaches still struggle to align with complex prompts. While decoder-only Large Language Models (LLMs) excel at handling such intricate texts, their integration with text-to-image generative models remains unsatisfactory. The rise of Diffusion Transformers (DiTs) presents a promising path forward via the deep fusion with LLMs. In this work, we explore this deep fusion for text-to-image synthesis by replacing the text stream Transformer in the MM-DiT model with an LLM, establishing shared self-attention between the LLM and DiT models. This design better aligns with the training objective and inference nature of both autoregressive and diffusion models, brigding the gap between the two paradigms. We empirically examine the design spaces of this approach and demonstrate its effectiveness through extensive experiments. We hope the positive evidence will kindle interest in this approach and inspire reflection on the pursuit of utilizing LLMs for text-to-image synthesis.
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Vikash Sehwag · Xianghao Kong · Jingtao Li · Michael Spranger · Lingjuan Lyu
As scaling laws in generative AI push performance, they simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to unlock this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose randomly masking up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only 1,890 USD economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive performance across both automated and human-centric evaluations, as well as high-quality generations, while incurring 118$\times$ lower costs than Stable Diffusion models and 14$\times$ lower costs than the current state-of-the-art approach, which costs \$28,400. We also further investigate the influence of synthetic images on performance and demonstrate that micro-budget training on only synthetic images is sufficient for achieving high-quality data generation.
Enhancing Creative Generation on Stable Diffusion-based Models
Jiyeon Han · Dahee Kwon · Gayoung Lee · Junho Kim · Jaesik Choi
Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative generation capacity remains limited, as simply adding the term ``creative" to prompts often fails to yield genuinely creative results. In this paper, we introduce C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 allows user-friendly creativity control in image generation and is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models. Source codes will be publicly available.
APT: Adaptive Personalized Training for Diffusion Models with Limited Data
JungWoo Chae · Jiyoon Kim · Jaewoong Choi · Kyungyul Kim · Sangheum Hwang
Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and stabilizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2) Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Yunhong Lu · Qichao Wang · Hengyuan Cao · Xierui Wang · Xiaoyin Xu · Min Zhang
Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.
STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search
Yuning Qiu · Andong Wang · Chao Li · Haonan Huang · Guoxu Zhou · Qibin Zhao
Recent text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in visual synthesis, yet their performance heavily relies on the quality of input prompts. However, optimizing discrete prompts remains challenging because the discrete nature of tokens prevents the direct application of the gradient descent method and the vast search space of possible token combinations. As a result, existing approaches either suffer from quantization errors when employing continuous optimization techniques or become trapped in local optima due to coordinate-wise greedy search. In this paper, we propose STEPS, a novel Sequential probability Tensor Estimation approach for hard Prompt Search. Our method reformulates discrete prompt optimization as a sequential probability tensor estimation problem, leveraging the inherent low-rank characteristics to address the curse of dimensionality. To further improve the computational efficiency, we develop a memory-bounded sampling approach that shrinks the sequential probability without the iteration step dependency while preserving sequential optimization dynamics. Extensive experiments on various public datasets demonstrate that our method consistently outperforms existing approaches in T2I generation, cross-model prompt transferability, and harmful prompt optimization, validating the effectiveness of the proposed framework.
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
Eduard Poesina · Adriana Valentina Costache · Adrian-Gabriel Chifu · Josiane Mothe · Radu Tudor Ionescu
Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, driven by the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (referred to as prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. Additionally, we extend these evaluations to text-to-image retrieval by collecting manual annotations that represent retrieval performance. We thus establish the first joint benchmark for prompt and query performance prediction (PQPP) across both tasks, comprising over 10K queries. Our benchmark enables (i) the comparative assessment of prompt/query difficulty in both image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We evaluate several pre- and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code are publicly available at https://anonymous.4open.science/r/PQPP-D332.
Let's Verify and Reinforce Image Generation Step by Step
Renrui Zhang · Chengzhuo Tong · Zhizheng Zhao · Ziyu Guo · Haoquan Zhang · Manyuan Zhang · Jiaming Liu · Peng Gao · Hongsheng Li
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigatation in the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment mechanism, merging the strengths of existing reward models. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation.
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
Krishnakant Singh · Simone Schaub-Meyer · Stefan Roth
Object-centric learning aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable a variety of downstream tasks. Yet, object-centric learning struggles on real-world datasets, which contain multiple objects of complex textures and shapes in natural everyday scenes. To address this, we introduce Guided Latent Slot Diffusion (GLASS), a novel slot attention model that learns in the space of generated images and uses semantic and instance guidance modules to learn better slot embeddings for various downstream tasks. Our experiments show that GLASS surpasses state-of-the-art slot attention methods by a wide margin on tasks such as (zero-shot) object discovery and conditional image generation for real-world scenes. Moreover, GLASS enables the first application of slot attention to compositional generation of complex, realistic scenes.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Jianzong Wu · Chao Tang · Jingbo Wang · Yanhong Zeng · Xiangtai Li · Yunhai Tong
Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The code, model, and dataset will be open-sourced to the community.
POSTA: A Go-to Framework for Customized Artistic Poster Generation
Haoyu Chen · Xiaojie Xu · Wenbo Li · Jingjing Ren · Tian Ye · Songhua Liu · Ying-Cong Chen · Lei Zhu · Xinchao Wang
Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA’s exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.
StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts
Zhaoxing Gan · Mengtian Li · Ruhua Chen · Zhongxia JI · Sichen Guo · Huanling Hu · Guangnan Ye · Zuo Hu
In this work, we introduce $\textbf{StageDesigner}$, the first comprehensive framework for artistic stage generation using large language models (LLMs) combined with layout-controlled diffusion models. Given the professional requirements of stage scenography, StageDesigner simulates the workflows of seasoned artists to generate immersive 3D stage scenes. Specifically, our approach is divided into three primary modules: $\textit{Script Analysis}$, which extracts thematic and spatial cues from input scripts; $\textit{Foreground Generation}$, which constructs and arranges essential 3D objects; and $\textit{Background Generation}$, which produces a harmonious background aligned with the narrative atmosphere and maintains spatial coherence by managing occlusions between foreground and background elements. Furthermore, we introduce the $\textbf{StagePro-V1}$ dataset, a dedicated dataset with 276 unique stage scenes spanning different historical styles and annotated with scripts, images, and detailed 3D layouts, specifically tailored for this task. Finally, evaluations using both standard and newly proposed metrics, along with extensive user studies, demonstrate the effectiveness of StageDesigner, showcasing its ability to produce visually and thematically cohesive stages that meet both artistic and spatial coherence standards.
Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
Aditya Ganeshan · Thibault Groueix · Paul Guerrero · Radomir Mech · Matthew Fisher · Daniel Ritchie
Pattern images are everywhere in the digital and physical worlds, and tools to edit them are valuable. But editing pattern images is tricky: desired edits are often programmatic: structure-aware edits that alter the underlying program which generates the pattern. One could attempt to infer this underlying program, but current methods for doing so struggle with complex images and produce unorganized programs that make editing tedious. In this work, we introduce a novel approach to perform programmatic edits on pattern images. By using a pattern analogy—a pair of simple patterns to demonstrate the intended edit—and a learning-based generative model to execute these edits, our method allows users to intuitively edit patterns. To enable this paradigm, we introduce SplitWeave, a domain-specific language that, combined with a framework for sampling synthetic pattern analogies, enables the creation of a large, high-quality synthetic training dataset. We also present TriFuser, a Latent Diffusion Model (LDM) designed to overcome critical issues that arise when naively deploying LDMs to this task. Extensive experiments on real-world, artist-sourced patterns reveals that our method faithfully performs the demonstrated edit while also generalizing to related pattern styles beyond its training distribution.
Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction
Shanshan Huang · Haoxuan Li · Chunyuan Zheng · Mingyuan Ge · WeiGao · Lei Wang · Li Liu
Fashion image editing is a valuable tool for designers to convey their creative ideas by visualizing design concepts.With the recent advances in text editing methods, significant progress has been made in fashion image editing. However, they face two key challenges: spurious correlation in training data often induce changes in other regions when editing a given concept, and these models typically lack the ability to edit multiple concepts simultaneously.To address the above challenges, we propose a novel Text-driven Fashion Image ediTing framework called T-FIT to mitigate the impact of spurious correlation by integrating counterfactual reasoning with compositional concept learning to precisely ensure compositional multi-concept fashion image editing relying solely on text descriptions.Specifically, T-FIT includes three key components: (i) counterfactual abduction module, which learns an exogenous variable of the source image by a denoising U-Net model. (ii) concept learning module, which identifies concepts in fashion image editing—such as clothing types and colors and projects a target concept into the space spanned from a series of textual prompts. (iii) concept composition module, which enables simultaneous adjustments of multiple concepts by aggregating each concept’s direction vector obtained from the concept learning module. Extensive experiments demonstrate that our method can efficiently achieve state-of-the-art performance on various fine-grained fashion image editing tasks, including single-concept editing (e.g., sleeve length, clothing type) as well as multi-concept editing (e.g., color \& sleeve length, fabric \& clothing type).
Controllable Human Image Generation with Personalized Multi-Garments
Yisol Choi · Sangkyung Kwak · Sihyun Yu · Hyungwon Choi · Jinwoo Shin
We present BootControl, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments.Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human.To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image.To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment.Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details.We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
Zengqun Zhao · Ziquan Liu · Yu Cao · Shaogang Gong · Ioannis Patras
Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters
Yuan Wang · Ouxiang Li · Tingting Mu · Yanbin Hao · Kuien Liu · Xiang Wang · Xiangnan He
The success of text-to-image generation enabled by diffusion models has imposed an urgent need to erase unwanted concepts, e.g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner. The twofold demand of concept erasure requires a precise removal of the target concept during generation (i.e., erasure efficacy), while a minimal impact on non-target content generation (i.e., prior preservation). Existing methods are either computationally costly or face challenges in maintaining an effective balance between erasure efficacy and prior preservation. To improve, we propose a precise, fast, and low-cost concept erasure method, called \textbf{Ada}ptive \textbf{V}aule \textbf{D}ecomposer (AdaVD), which is training-free. This method is grounded in a classical linear algebraic orthogonal complement operation, implemented in the value space of each cross-attention layer within the UNet of diffusion models. An effective shift factor is designed to adaptively navigate the erasure strength, enhancing prior preservation without sacrificing erasure efficacy. Extensive experimental results show that the proposed AdaVD is effective at both single and multiple concept erasure, showing a 2- to 10-fold improvement in prior preservation as compared to the second best, meanwhile achieving the best or near best erasure efficacy, when comparing with both training-based and training-free state of the arts. AdaVD supports a series of diffusion models and downstream image generation tasks, with code to be publicly available.
Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models
Jie Ren · Kangrui Chen · Yingqian Cui · Shenglai Zeng · Hui Liu · Yue Xing · Jiliang Tang · Lingjuan Lyu
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.
Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
Huayang Huang · Xiangye Jin · Jiaxu Miao · Yu Wu
The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics.The strong concealment and transferability of our attack across various scenarios further underscore the significance of our approach.
Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?
Zebin You · Xinyu Zhang · Hanzhong Guo · Jingdong Wang · Chongxuan Li
The ultimate goal of generative models is to perfectly capture the data distribution. For image generation, common metrics of visual quality (e.g., FID) and the perceived truthfulness of generated images seem to suggest that we are nearing this goal. However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. Moreover, we uncover an intriguing discrepancy: classifiers can easily differentiate between diffusion models with comparable performance (e.g., U-ViT-H vs. DiT-XL), but struggle to distinguish between models within the same family but of different scales (e.g., EDM2-XS vs. EDM2-XXL). Our methodology carries several important implications. First, it naturally serves as a diagnostic tool for diffusion models by analyzing specific features of generated data. Second, it sheds light on the model autophagy disorder and offers insights into the use of generated data: augmenting real data with generated data is more effective than replacing it.
Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models
Namhyuk Ahn · KiYoon Yoo · Wonhyuk Ahn · Daesik Kim · Seung-Hun Nam
Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at
Fingerprinting Denoising Diffusion Probabilistic Models
Huan Teng · Yuhui Quan · Chengyu Wang · Jun Huang · Hui Ji
Diffusion models, especially Denoising Diffusion Probabilistic Models (DDPMs) and their variants, are prevalent tools in generative AI, making the protection of their Intellectual Property (IP) rights increasingly important. Most existing methods on IP right protection for DDPMs are invasive, e.g., watermarking methods, which alter model parameters and raise concerns about performance degradation, also with requirement for extra computational resources for retraining or fine-tuning. In this paper, we propose the first non-invasive fingerprinting scheme for DDPMs, requiring no parameter changes or fine-tuning, and ensuring that the generation quality of DDPMs remains intact. We introduce a discriminative and robust fingerprint latent space, based on the well-designed crossing route of samples that span the performance border zone of DDPMs, with only black-box access required for the diffusion denoiser in the ownership verification stage. Extensive experiments demonstrate that our fingerprinting approach enjoys both robustness against the often-seen attacks and distinctiveness on various DDPMs, providing an alternative for protecting DDPMs' IP rights without compromising their performance or integrity.
Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
Haoyue Bai · Yiyou Sun · Wei Cheng · Haifeng Chen
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real-world scenarios. In this work, we introduce a novel black-box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt-and-recover strategy: by masking part of an image and assessing the model’s ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked-image inputs, we incorporate a cost-efficient surrogate model trained to align with the target model’s distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Zhenglin Huang · Jinwei Hu · Yiwei He · Xiangtai Li · Xiaowei Huang · Bei Peng · Xingyu Zhao · Baoyuan Wu · Guangliang Cheng
The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the $\textbf{S}$ocial media $\textbf{I}$mage $\textbf{D}$etection data$\textbf{Set}$ (SID-Set), which offers three key advantages: (1) $\textbf{extensive volume}$, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) $\textbf{broad diversity}$, encompassing fully synthetic and tampered images across various classes, and (3) $\textbf{elevated realism}$, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA ($\textbf{S}$ocial media $\textbf{I}$mage $\textbf{D}$etection, localization, and explanation $\textbf{A}$ssistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.
Be More Specific: Evaluating Object-centric Realism in Synthetic Images
Anqi Liang · Ciprian Adrian Corneanu · Qianli Feng · Giorgio Giannone · Aleix Martinez
Evaluation of synthetic images is important for both model development and selection. An ideal evaluation should be specific, accurate and aligned with human perception. This paper addresses the problem of evaluating realism of objects in synthetic images. Although methods has been proposed to evaluate holistic realism, there are no methods tailored towards object-centric realism evaluation. In this work, we define a new standard for assessing object-centric realism that follows a shape-texture breakdown and proposes the first object-centric realism evaluation dataset for synthetic images. The dataset contains images generated from state-of-the-art image generative models and is richly annotated at object level across a diverse set of object categories. We then design and train the OLIP model, a dedicated architecture that considerably outperforms any existing baseline on object-centric realism evaluation.
NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery
Reese Kneeland · Paul Scotti · Ghislain St-Yves · Jesse L Breedlove · Kendrick N Kay · Thomas Naselaris
We release NSD-Imagery, a benchmark dataset of human fMRI activity paired with mental images, to complement the existing Natural Scenes Dataset (NSD), a large-scale dataset of fMRI activity paired with seen images that enabled unprecedented improvements in fMRI-to-image reconstruction efforts. Recent models trained on NSD have been evaluated only on seen image reconstruction. Using NSD-Imagery, it is possible to assess how well these models perform on mental image reconstruction. This is a challenging generalization requirement because mental images are encoded in human brain activity with relatively lower signal-to-noise and spatial resolution; however, generalization from seen to mental imagery is critical for real-world applications in medical domains and brain-computer interfaces, where the desired information is always internally generated. We provide benchmarks for a suite of recent NSD-trained open-source visual decoding models (MindEye1, MindEye2, Brain Diffuser, iCNN, Takagi et al.) on NSD-Imagery, and show that the performance of decoding methods on mental images is largely decoupled from performance on vision tasks. We further demonstrate that architectural choices significantly impact cross-decoding performance: models employing simple linear decoding architectures and multimodal feature decoding generalize better to mental imagery, while complex architectures tend to overfit training data recorded exclusively from vision. Our findings indicate that mental imagery datasets are critical for the development of practical applications, and establish NSD-Imagery as a useful resource for better aligning visual decoding methods with this goal.
State Space Models (SSMs) are powerful tools for modeling sequential data in computer vision and time series analysis domains. However, traditional SSMs are limited by fixed, one-dimensional sequential processing, which restricts their ability to model non-local interactions in high-dimensional data. While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. Using Chazelle's Minimum Spanning Tree algorithm, GG-SSMs adapt to the inherent data structure, enabling robust feature propagation across dynamically generated graphs and efficiently modeling complex dependencies. We validate GG-SSMs on 11 diverse datasets, including event-based eye-tracking, ImageNet classification, optical flow estimation, and six time series datasets. GG-SSMs achieve state-of-the-art performance across all tasks, surpassing existing methods by significant margins. Specifically, GG-SSM attains a top-1 accuracy of 84.9% on ImageNet, outperforming prior SSMs by 1%, reducing the KITTI-15 error rate to 2.77%, and improving eye-tracking detection rates by up to 0.33% with fewer parameters. These results demonstrate that dynamic scanning based on feature relationships significantly improves SSMs' representational power and efficiency, offering a versatile tool for various applications in computer vision and beyond.
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Fiona Ryan · Ajay Bati · Sangmin Lee · Daniel Bolya · Judy Hoffman · James Rehg
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person’s gaze target requires reasoning both about the person’s appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices.
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang · Shuai Liu · Hongming Guo · Yuhao Dong · Xiamengwei Zhang · Sicheng Zhang · Pengyun Wang · Zitang Zhou · Binzhu Xie · Ziyue Wang · Bei Ouyang · Zhengyu Lin · Marco Cominelli · Zhongang Cai · Bo Li · Yuanhan Zhang · Peiyuan Zhang · Fangzhou Hong · Joerg Widmer · Francesco Gringoli · Lei Yang · Ziwei Liu
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities—including discussions, shopping, cooking, socializing, and entertainment—using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations.To address the key technical challenges of 1) developing robust visual-audio models for egocentric data, 2) enabling accurate identity recognition, and 3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is a vision-language model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng · Masato Ishii · Akio Hayakawa · Takashi Shibuya · Alexander G. Schwing · Yuki Mitsufuji
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework (MMAudio). In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and models will be made available.
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
Shentong Mo · Yibing Song
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose Foley-Flow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that Foley-Flow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
Chen Liu · Peike Li · Liying Yang · Dadong Wang · Lincheng Li · Xin Yu
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences—such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model’s attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Boseung Jeong · Jicheol Park · Sungyeon Kim · Suha Kwak
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content.Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals.In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment.Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.
SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
Yuji Wang · Haoran Xu · Yong Liu · Jiaze Li · Yansong Tang
Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5\% in J&F on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available soon.
Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues
Sihong Huang · Jiaxin Wu · Xiaoyong Wei · Yi Cai · Dongmei Jiang · Yaowei Wang
Understanding human behavior and the environmental information in the egocentric video is very challenging due to the invisibility of some actions (e.g., laughing and sneezing) and the local nature of the first-person view. Leveraging the corresponding exocentric video to provide global context has shown promising results. However, existing visual-to-visual and visual-to-textual Ego-Exo video alignment methods struggle with the problem that there could be non-visual overlap for the same activity. To address this, we propose using sound as a bridge, as audio is often consistent across Ego-Exo videos. However, direct audio-to-audio alignment lacks context. Thus, we introduce two context-aware sound modules: one aligns audio with vision via a visual-audio cross-attention module, and another aligns text with sound closed caption generated by LLM. Experimental results on two Ego-Exo video association benchmarks show that either of the two proposed modules manages to improve the state-of-the-art methods. Moreover, the proposed sound-aware egocentric or exocentric representation boosts the performance of downstream tasks, such as action recognition of exocentric videos and scene recognition of egocentric videos. The code and models can be accessed at https://github.com/openuponacceptance.
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Songhao Han · Wei Huang · Hairong Shi · Le Zhuo · Xiu Su · Shifeng Zhang · Xu Zhou · Xiaojuan Qi · Yue Liao · Si Liu
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities.
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
Yulu Pan · Ce Zhang · Gedas Bertasius
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains more than 4,400 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. We will release the dataset upon the acceptance of the paper.
SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang · Yujia Chen · Wen-Sheng Chu · Vishnu Naresh Boddeti · Du Tran
Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.
Unified Dense Prediction of Video Diffusion
Lehan Yang · Lu Qi · Xiangtai Li · Sheng Li · Varun Jampani · Ming-Hsuan Yang
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset Panda-Dense, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness. All source codes and models will be made publicly available.
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Tiehan Fan · Kepan Nan · Rui Xie · Penghao Zhou · Zhenheng Yang · Chaoyou Fu · Xiang Li · Jian Yang · Ying Tai
Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed $\mathtt{InstanceCap}$, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a $22$K $\mathtt{InstanceVid}$ dataset is curated for training, and an enhancement pipeline that tailored to $\mathtt{InstanceCap}$ structure is proposed for inference. Experimental results demonstrate that our proposed $\mathtt{InstanceCap}$ significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Weijia Wu · Mingyu Liu · Zeyu Zhu · Haoen Feng · Xi Xia · Wen Wang · Kevin Qinghong Lin · Chunhua Shen · Mike Zheng Shou
Recent advancements in video generation models, such as Stable Video Diffusion, have shown promising results, but these works primarily focus on short videos, often limited to a single scene and lacking a rich storyline. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is currently no publicly accessible dataset specifically designed for analyzing, evaluating, and training models for long video generation. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) character consistency across scenes, (2) long videos with rich and coherent storylines, and (3) multi-scene narratives. MovieBench features three distinct levels of annotation: the movie level, which provides a broad overview of the film; the scene level, offering a mid-level understanding of the narrative; and the shot level, which emphasizes specific moments with detailed descriptions.
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
feilong tang · Chengzhi Liu · Zhongxing Xu · Ming Hu · Zile Huang · Haochen Xue · Ziyang Chen · Zelin Peng · Zhiwei Yang · Sijin Zhou · Wenxue Li · Yulong Li · Wenxuan Song · Shiyan Su · Wei Feng · Jionglong Su · Mingquan Lin · Yifan Peng · Xuelian Cheng · Imran Razzak · Zongyuan Ge
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
chenkai zhang · Yiming Lei · Zeming Liu · Haitao Leng · Shaoguo Liu · Tingting Gao · Qingjie Liu · Yunhong Wang
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus solely on standalone videos and assess only “visual elements” in videos, such as human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding to solve. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance the model's capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities for understanding narrative-driven series, guiding future MLLM development.
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Sagnik Majumder · Tushar Nagarajan · Ziad Al-Halah · Reina Pradhan · Kristen Grauman
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive “best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video—no language or camera poses—and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Zhongwei Ren · Yunchao Wei · Xun Guo · Yao Zhao · Bingyi Kang · Jiashi Feng · Xiaojie Jin
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop an autoregressive video generation model, Visioner, trained exclusively on raw video data, and test its knowledge acquisition abilities in video-based Go and robotic control environments. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning extensive knowledge, and (2) the compactness of visual representations significantly enhances learning efficiency. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM). Remarkably, Visioner reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research.
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
Chris Dongjoo Kim · Jihwan Moon · Sangwoo Moon · Heeseung Yun · Sihaeng Lee · Aniruddha Kembhavi · Soonyoung Lee · Gunhee Kim · Sangho Lee · Christopher Clark
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness.One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose $\textbf{Re}$levance and $\textbf{Spec}$ificity-based online filtering framework ($\textbf{ReSpec}$) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness.By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute.Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zero-shot video retrieval tasks, using as little as 5\% of the data while incurring minimal compute.
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova · Arsha Nagrani · Bernt Schiele · Hilde Kuehne · Christian Rupprecht
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities.Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias video benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias — determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias — assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias —evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. Since our new toolkit allows us to analyze representation biases at scale without additional human annotation, we conduct a systematic and comprehensive analysis of representation biases in 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 33 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object debiased test splits.
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim · AJ Piergiovanni · Ganesh Satish Mallya · Anelia Angelova
We introduce a benchmark and learning framework for advancing video-text compositionality understanding, aimed at enhancing vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark focuses on fine-grained video-text alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (\eg ActivityNet-Captions, YouCook2), we create challenging negative samples with subtle temporal disruptions such as reordering, action word replacements, partial captioning, and combined disruptions that comprehensively test models’ compositional sensitivity across extended, cohesive video-text sequences. To enhance model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and progressively reduces similarity for increasingly disrupted pairs, encouraging fine-grained compositional alignment. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences, facilitating effective compositional learning. We evaluate large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in video-text compositionality. Our work provides a comprehensive framework for assessing and advancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
Flexible Frame Selection for Efficient Video Reasoning
Shyamal Buch · Arsha Nagrani · Anurag Arnab · Cordelia Schmid
Video-language models have shown promise for addressing a range of multimodal tasks for video understanding, such as video question-answering. However, the inherent computational challenges of processing long video data and increasing model sizes have led to standard approaches that are limited by the number of frames they can process. In this work, we propose the Flexible Frame Selector (FFS), a learnable policy model with a new flexible selection operation, that helps alleviate input context restrictions by enabling video-language models to focus on the most informative frames for the downstream multimodal task, without adding undue processing cost. Our method differentiates from prior work due to its learnability, efficiency, and flexibility. We verify the efficacy of our method on standard video-question answering and reasoning benchmarks, and observe that our model can improve base video-language model accuracy while reducing the number of downstream processed frames.
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen · Yiqi Lin · Ziyun Zeng · Wei Li · Zejun Ma · Mike Zheng Shou
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method enables the model to learn fine-grained vision-language correlations in temporal. To support this, we introduce a series of data processing techniques on YouTube videos and closed captions (CC), resulting in 30M pre-training data samples and 1.5M for instruction tuning. Benefiting from our training paradigm, the trained model is powerful at streaming applications and can naturally support real-time video commentary. We also introduce a new benchmark focused on sports commentary and event understanding, a domain where live performance is critical. Experiments show that our model outperforms state-of-the-art models in both accuracy and latency. Additionally, our model achieves state-of-the-art or competitive results on several mainstream benchmarks, demonstrating its broad generalizability. We will release the codes, datasets, and models to facilitate further research.
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md Mohaiminul Islam · Tushar Nagarajan · Huiyu Wang · Gedas Bertasius · Lorenzo Torresani
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. To lower the computational cost, most prior methods rely on compression strategies, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce \model, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a token sequence that is orders of magnitude smaller for efficient LLM processing. Extensive experiments demonstrate that \model\ achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including EgoSchema, NextQA, TempCompass, and MVBench.
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu · Zikai Song · Na Feng · Yawei Luo · Junqing Yu · Yi-Ping Phoebe Chen · Wei Yang
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos;(2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities.We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang · Jihao Qiu · Lingxi Xie · Yunjie Tian · Jianbin Jiao · Qixiang Ye
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Our code and models will be open-sourced.
Efficient Transfer Learning for Video-language Foundation Models
Haoxing Chen · Zizheng Huang · Yan Hong · YANSHUO WANG · Zhongcai Lyu · Zhuoer Xu · Jun Lan · Zhangxuan Gu
Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capacity brought by these additional parameters helps better fit the video-specific inductive biases, existing methods require learning a large number of parameters and are prone to catastrophic forgetting of the original generalizable knowledge. In this paper, we propose a simple yet effective Multi-modal Spatio-Temporal Adapter (MSTA) to improve the alignment between representations in the text and vision branches, achieving a balance between general knowledge and task-specific knowledge. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint. This constraint involves feeding template inputs (i.e., ``a video of $\{\textbf{cls}\}$'') into the trainable language branch, while LLM-generated spatio-temporal descriptions are input into the pre-trained language branch, enforcing consistency between the outputs of the two branches. This mechanism prevents over-fitting to downstream tasks and improves the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA achieves outstanding performance across all evaluations, while using only 2-7\% of the trainable parameters in the original model.
EventGPT: Event Stream Understanding with Multimodal Large Language Models
shaoyu liu · Jianing Li · guanghui zhao · Yunjian Zhang · Xin Meng · Fei Richard Yu · Xiangyang Ji · Ming Li
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive evaluation benchmark, and extensive experiments demonstrate that EventGPT outperforms previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen · Pha Nguyen · Jackson Cothren · Alper Yilmaz · Khoa Luu
Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation
Mu Chen · Liulei Li · Wenguan Wang · Yi Yang
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline.Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DiffVsgg, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects.This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DiffVsgg further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames.Extensive experiments on three setups of Action Genome demonstrate the superiority of DiffVsgg. Our code shall be released.
CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design
Weitao Feng · Hang Zhou · Jing Liao · Li Cheng · Wenbo Zhou
We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CASAGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Sitong Gong · Yunzhi Zhuge · Lu Zhang · Zongxin Yang · Pingping Zhang · Huchuan Lu
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens. Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
Zixuan Chen · Jiaxin Li · Junxuan Liang · Liming Tan · Yejie Guo · Cewu Lu · Yonglu Li
Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g. phase transitions.However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS)}, to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M$^3$-VOS, yielding several key insights. Notably, current appearance-based approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-and-play model that improves its performance by reversal refinement.Our data and code will be publicly available.
Anomize: Better Open Vocabulary Video Anomaly Detection
Fei Li · Wenxuan Liu · Jingjing Chen · Ruixu Zhang · Yuran Wang · Xian Zhong · Zheng Wang
Open Vocabulary Video Anomaly Detection (OVVAD) aims to detect and categorize both base and novel anomalies. However, there are two specific challenges related to novel anomalies that remain unexplored by existing methods. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often miscategorized as visually similar base instances. To address the aforementioned challenges, we investigate supportive information from multiple sources, aiming to reduce detection ambiguity by leveraging multiple levels of visual data with matching textual information. Additionally, we propose introducing relationships between labels to guide the encoding of new labels, thereby enhancing the alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. Our resulting Anomize framework effectively addresses these challenges, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its strength in OVVAD.
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Chen Tang · Xinzhu Ma · Encheng Su · Xiufeng Song · Xiaohong Liu · Wei-Hong Li · Lei Bai · Wanli Ouyang · Xiangyu Yue
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce UniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Our code and dataset will be released soon.
Temporal Action Detection Model Compression by Progressive Block Drop
Xiaoyong Chen · Yong Guo · Jiaming Liang · Sitong Zhuang · Runhao Zeng · Xiping Hu
Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25\% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Yuting Zhang · Hao Lu · Qingyong Hu · Yin Wang · Kaishen Yuan · Xin Liu · Kaishun Wu
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an ``Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a Resisting Logical Oblivion optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code will be available on GitHub.
Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition
Hongda Liu · Yunfan Liu · Min Ren · Hao Wang · Yunlong Wang · Zhenan Sun
In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The source code is enclosed in the supplementary material and will be released upon acceptance.
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery
Utkarsh Mall · Cheng Perng Phoo · Mia Chiquier · Bharath Hariharan · Kavita Bala · Carl Vondrick
Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation.
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification
Gaozheng Pei · Shaojie Lyu · Gong Chen · Ke Ma · Qianqian Xu · Yingfei Sun · Qingming Huang
Existing diffusion-based purification methods aim to disrupt adversarial perturbations by introducing a certain amount of noise through a forward diffusion process, followed by a reverse process to recover clean examples. However, this approach is fundamentally flawed: the uniform operation of the forward process across all pixels compromises normal pixels while attempting to combat adversarial perturbations, resulting in the target model producing incorrect predictions. Simply relying on low-intensity noise is insufficient for effective defense. To address this critical issue, we implement a heterogeneous purification strategy grounded in the interpretability of neural networks. Our method decisively applies higher-intensity noise to specific pixels that the target model focuses on while remaining pixels are subjected to only low-intensity noise. This requirement motivates us to redesign the sampling process of the diffusion model, allowing for effective removal of varying noise levels. Furthermore, to evaluate our method against strong adaptative attack, our proposed method sharply reduces time cost and memory usage through a single step resampling. The empirical evidence from extensive experiments across three datasets demonstrates that our method outperforms most current adversarial training and purification techniques by a substantial margin. Code is available at \url{https://anonymous.4open.science/r/Purification-35BE-0829}.
SDBF: Steep-Decision-Boundary Fingerprinting for Hard-Label Tampering Detection of DNN Models
Xiaofan Bai · Shixin Li · Xiaojing Ma · Bin Benjamin Zhu · Dongmei Zhang · Linchen Yu
Cloud-based AI systems offer significant benefits but also introduce vulnerabilities, making deep neural network (DNN) models susceptible to malicious tampering. This tampering may involve harmful behavior injection or resource reduction, compromising model integrity and performance. To detect model tampering, hard-label fingerprinting techniques generate sensitive samples to probe and reveal tampering. Existing fingerprinting methods are mainly based on \textbf{gradient-defined sensitivity} or \textbf{decision boundary}, with the latter showing a manifest superior detection performance. However, existing decision-boundary-based fingerprinting methods remain conceptual, lacking a theoretical explanation for why samples near the decision boundary are more sensitive to tampering. Moreover, all existing fingerprinting methods either suffer from insufficient sensitivity or incur high computational costs.In this paper, we provide the first theoretical justification for why samples near the decision boundary are more sensitive to tampering-induced shifts than the faraway. Based on this, we further propose \textbf{Steep-Decision-Boundary Fingerprinting (SDBF)}, a novel lightweight approach for hard-label tampering detection. SDBF places fingerprint samples near the \textbf{steep decision boundary}, where the outputs of samples are inherently highly sensitive to tampering. We also design a \textbf{Max Boundary Coverage Strategy (MBCS)}, which enhances samples' diversity over the decision boundary. Theoretical analysis and extensive experimental results show that SDBF outperforms existing SOTA hard-label fingerprinting methods in both sensitivity and efficiency.
From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning
Ziang Li · Hongguang Zhang · Juan Wang · Meihui Chen · Hongxin Hu · Wenzhe Yi · Xiaoyang Xu · Mengda Yang · Chenjun Ma
Model Inversion Attacks (MIAs) aim to reconstruct private training data from models, leading to privacy leakage, particularly in facial recognition systems. Although many studies have enhanced the effectiveness of white-box MIAs, less attention has been paid to improving efficiency and utility under limited attacker capabilities. Existing black-box MIAs necessitate an impractical number of queries, incurring significant overhead. Therefore, we analyze the limitations of existing MIAs and introduce Surrogate Model-based Inversion with Long-tailed Enhancement (SMILE), a high-resolution oriented and query-efficient MIA for the black-box setting. We begin by analyzing the initialization of MIAs from a data distribution perspective and propose a long-tailed surrogate training method to obtain high-quality initial points. We then enhance the attack's effectiveness by employing the gradient-free black-box optimization algorithm selected by NGOpt. Our experiments show that SMILE outperforms existing state-of-the-art black-box MIAs while requiring only about 5% of the query overhead.
UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition
Meng Pang · Wenjun Zhang · Nanrun Zhou · Shengbo Chen · Hong Rao
Face normalization aims to enhance the robustness and effectiveness of face recognition systems by mitigating intra-personal variations in expressions, poses, occlusions, illuminations, and domains. Existing methods face limitations in handling multiple variations and adapting to cross-domain scenarios. To address these challenges, we propose a novel Unified Multi-Domain Face Normalization Network (UMFN) model, which can process face images with various types of facial variations from different domains, and reconstruct frontal, neutral-expression facial prototypes in the target domain. As an unsupervised domain adaptation model, UMFN facilitates concurrent training on multiple datasets across domains and demonstrates strong prototype reconstruction capabilities. Notably, UMFN serves as a joint prototype and feature learning framework, enabling the simultaneous extraction of domain-agnostic identity features through a decoupling mapping network and a feature domain classifier for adversarial training. Moreover, we design an efficient Heterogeneous Face Recognition (HFR) network that fuses domain-agnostic and identity-discriminative features for HFR, and introduce contrastive learning to enhance identity recognition accuracy. Empirical studies on diverse cross-domain face datasets validate the effectiveness of our proposed method.
MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks
Zeqi Zhu · Ibrahim Batuhan Akkaya · Luc Waeijen · Egor Bondarev · Arash Pourtaherian · Orlando Moreira
Deep Neural Networks (DNNs) are accurate but compute-intensive, leading to substantial energy consumption during inference. Exploiting temporal redundancy through $\Delta$-$\Sigma$ convolution in video processing has proven to greatly enhance computation efficiency. However, temporal $\Delta$-$\Sigma$ DNNs typically require substantial memory for storing neuron states to compute inter-frame differences, hindering their on-chip deployment. To mitigate this memory cost, directly compressing the states can disrupt the linearity of temporal $\Delta$-$\Sigma$ convolution, causing accumulated errors in long-term $\Delta$-$\Sigma$ processing. Thus, we propose $\textbf{MEET}$, an optimization framework for $\textbf{ME}$mory-$\textbf{E}$fficient $\textbf{T}$emporal $\Delta$-$\Sigma$ DNNs. MEET transfers the state compression challenge to a well-established weight compression problem by trading fewer activations for more weights and introduces a co-design of network architecture and suppression method to optimize for mixed spatial-temporal execution. Evaluations on three vision applications demonstrate a reduction of 5.1$\sim$13.3 $\times$ in total memory compared to the most computation-efficient temporal DNNs, while preserving the computation efficiency and model accuracy in long-term $\Delta$-$\Sigma$ processing. MEET facilitates the deployment of temporal $\Delta$-$\Sigma$ DNNs within on-chip memory of embedded event-driven platforms, empowering low-power edge processing.
Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
Xiao Wang · Yu Jin · Wentao Wu · Wei Zhang · Lin Zhu · Bo Jiang · Yonghong Tian
Object detection in event streams has emerged as a cutting-edge research area, demonstrating superior performance in low-light conditions, scenarios with motion blur, and rapid movements. Current detectors leverage spiking neural networks, Transformers, or convolutional neural networks as their core architectures, each with its own set of limitations including restricted performance, high computational overhead, or limited local receptive fields. This paper introduces a novel MoE (Mixture of Experts) heat conduction-based object detection algorithm that strikingly balances accuracy and computational efficiency. Initially, we employ a stem network for event data embedding, followed by processing through our innovative MoE-HCO blocks. Each block integrates various expert modules to mimic heat conduction within event streams. Subsequently, an IoU-based query selection module is utilized for efficient token extraction, which is then channeled into a detection head for the final object detection process. Furthermore, we are pleased to introduce EvDET200K, a novel benchmark dataset for event-based object detection. Captured with a high-definition Prophesee EVK4-HD event camera, this dataset encompasses 10 distinct categories, 200,000 bounding boxes, and 10,054 samples, each spanning 2 to 5 seconds. We also provide comprehensive results from over 15 state-of-the-art detectors, offering a solid foundation for future research and comparison.
Person De-reidentification: A Variation-guided Identity Shift Modeling
Yi-Xing Peng · Yu-Ming Tang · Kun-Yu Lin · Qize Yang · Jingke Meng · Xihan Wei · Wei-Shi Zheng
Person re-identification (ReID) is to associate images of individuals from different camera views against cross-view variations. Like other surveillance technologies, Re-ID faces serious privacy challenges, particularly the potential for unauthorized tracking. Although various tasks (e.g., face recognition) have developed machine unlearning techniques to address privacy concerns, such approaches have not yet been explored within the Re-ID field. In this work, we pioneer the exploration of the person de-reidentification (De-ReID) problem and present its inherent challenges. In the context of ReID, De-ReID is to unlearn the knowledge about accurately matching specific persons so that these ``unlearned persons'' cannot be re-identified across cameras for privacy guarantee. The primary challenge is to achieve the unlearning without degrading the identity-discriminative feature embeddings to ensure the model's utility. To address this, we formulate a De-ReID framework that utilizes a labeled dataset of unlearned persons for unlearning and an unlabeled dataset of other persons for knowledge preservation. Instead of unlearning based on (pseudo) identity labels, we introduce a variation-guided identity shift mechanism that unlearns the specific persons by fitting the variations in their images, irrespective of their identity, while preserving ReID ability on other persons by overcoming the variations in other images. As a result, the model shifts the unlearned persons to a feature space that is vulnerable to cross-view variations. Extensive experiments on benchmark datasets demonstrate the superiority of our method.
WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
Yu Mao · Jun Wang · Nan Guan · Chun Jason Xue
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we develop a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
BOE-ViT: Boosting Orientation Estimation with Equivariance in Self-Supervised 3D Subtomogram Alignment
Runmin Jiang · Jackson Daggett · Shriya Pingulkar · Yizhou Zhao · Priyanshu Dhingra · Daniel Brown · Qifeng Wu · Xiangrui Zeng · Xingjian Li · Min Xu
Subtomogram alignment is a critical task in cryo-electron tomography (cryo-ET) analysis, essential for achieving high-resolution reconstructions of macromolecular complexes. However, learning effective positional representations remains challenging due to limited labels and high noise levels inherent in cryo-ET data. In this work, we address this challenge by proposing a self-supervised learning approach that leverages intrinsic geometric transformations as implicit supervisory signals, enabling robust representation learning despite data scarcity. We introduce BOE-ViT, the first Vision Transformer (ViT) framework for 3D subtomogram alignment. Recognizing that traditional ViTs lack equivariance and are therefore suboptimal for orientation estimation, we enhance the model with two innovative modules that introduce equivariance include 1) the Polyshift module for improved shift estimation and 2) Multi-Axis Rotation Encoding (MARE) for enhanced rotation estimation. Experimental results demonstrate that BOE-ViT significantly outperforms state-of-the-art methods. Notably, at SNR 0.01 dataset, our approach achieves a 77.3\% reduction in rotation estimation error and a 62.5\% reduction in translation estimation error, effectively overcoming the challenges in cryo-ET subtomogram alignment.
Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting
Wei Lin · Chenyang ZHAO · Antoni B. Chan
Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample containing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training process encounters issues as the confidence for pseudo-labels fails to propagate to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) matching scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence for the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolves issues identified in PSAM.
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Shijia Zhao · Qiming Xia · Xusheng Guo · Pufan Zou · Maoji Zheng · Hai Wu · Chenglu Wen · Cheng Wang
Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions.Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code will be made publicly available.
Segment Anything, Even Occluded
Wei-En Tai · Yu-Lin Shih · Cheng Sun · Yu-Chiang Frank Wang · Hwann-Tzong Chen
Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects.Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.
BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis
Weiguang Zhao · Rui Zhang · Qiufeng Wang · Guangliang Cheng · Kaizhu Huang
3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. In particular, our method ranks the 2nd on the ScanNet200 official benchmark challenge, presenting the highest mIoU so far if excluding the 1st place winner that however involves large-scale training with auxiliary data.
SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
Hui Liu · Chen Jia · Fan Shi · Xu Cheng · Shengyong Chen
Pixel-level segmentation of structural cracks across various scenarios remains a considerable challenge. Current methods encounter challenges in effectively modeling crack morphology and texture, facing challenges in balancing segmentation quality with low computational resource usage. To overcome these limitations, we propose a lightweight Structure-Aware Vision Mamba Network (SCSegamba), capable of generating high-quality pixel-level segmentation maps by leveraging both the morphological information and texture cues of crack pixels with minimal computational cost. Specifically, we developed a Structure-Aware Visual State Space module (SAVSS), which incorporates a lightweight Gated Bottleneck Convolution (GBC) and a Structure-Aware Scanning Strategy (SASS). The key insight of GBC lies in its effectiveness in modeling the morphological information of cracks, while the SASS enhances the perception of crack topology and texture by strengthening the continuity of semantic information between crack pixels. Experiments on crack benchmark datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods, achieving the highest performance with only 2.8M parameters. On the multi-scenario dataset, our method reached 0.8390 in F1 score and 0.8479 in mIoU.
Despite the significant progress in continual image segmentation, existing arts still strive to balance between stability and plasticity. Additionally, they are specialist to specific tasks and models, which hinders the extension to more general situations. In this work, we present CUE, a novel Continual Universal sEgmentation pipeline that not only inherently tackles the stability-plasticity dilemma, but unifies any segmentation across tasks and models as well. Our key insight: any segmentation task can be reformulated as an understanding-then-refinement paradigm, which is inspired by humans' visual perception system to first perform high-level semantic understanding, then focus on low-level vision cues. We claim three desiderata for this design: Continuity by inherently avoiding the stability-plasticity dilemma via exploiting the natural differences between high-level and low-level knowledge. Generality by unifying and simplifying the landscape towards various segmentation tasks. Efficiency as an interesting by-product by significantly reducing the research effort. Our resulting model, built upon this pipeline by complementary expert models, shows significant improvements over previous state-of-the-arts across various segmentation tasks and datasets. We believe that our work is a significant step towards making continual segmentation more universal and practicable.
Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation
Tanner Schmidt · Richard Newcombe
This paper presents Segment This Thing (STT), a new efficient image segmentation model designed to produce a single segment given a single point prompt. Instead of following prior work and increasing efficiency by decreasing model size, we gain efficiency by foveating input images. Given an image and a point prompt, we extract a crop centered on the prompt and apply a novel variable-resolution patch tokenization in which patches are downsampled at a rate that increases with increased distance from the prompt. This approach yields far fewer image tokens than uniform patch tokenization. As a result we can drastically reduce the computational cost of segmentation without reducing model size. Furthermore, the foveation focuses the model on the region of interest, a potentially useful inductive bias. We show that our Segment This Thing model is more efficient than prior work while remaining competitive on segmentation benchmarks. It can easily run at interactive frame rates on consumer hardware and is thus a promising tool for augmented reality or robotics applications.
Probabilistic Prompt Distribution Learning for Animal Pose Estimation
Jiyong Rao · Brian Nlong Zhao · Yu Wang
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, e.g. CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings.
Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
Wenhuan Huang · Yi JI · guiqian zhu · Ying Li · chunping Liu
In scene graph generation (SGG), the accurate prediction of unseen triples is essential for its effectiveness in downstream vision-language tasks. We hypothesize that the predicates of unseen triples can be viewed as transformations of seen predicates in feature space, and the essence of the zero-shot task is to bridge the gap caused by this transformation. Traditional models, however, have difficulty addressing this challenge, which we attribute to their inability to model the predicates equivariant. To overcome this limitation, we introduce a novel framework based on capsule networks (CAPSGG). We propose a $\textbf{Three-Stream Pipeline}$ that generates modality-specific representations for predicates, while building low-level predicate capsules of these modalities. Then these capsules are aggregated into high-level predicate capsules using a $\textbf{Routing Capsule Layer}$. In addition, we introduce $\textbf{GroupLoss}$ to aggregate capsules with the same predicate label into groups. This replaces the global loss with the intra-group loss, effectively balancing the learning of predicate invariance and equivariant features, while mitigating the impact of the severe long-tail distribution of the predicate categories. Our extensive experiments demonstrate the notable superiority of our approach over state-of-the-art methods, with zero-shot indicators outperforming up to $\textbf{132.26\\%}$ on SGCls task than the T-CAR [21]. Our code will be available upon publication.
ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
Yun Chang · Leonor Fermoselle · Duy Ta · Bernadette Bucher · Luca Carlone · Jiuguang Wang
While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks —a process called hierarchical task analysis— is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis —to generate the task breakdown— with task-driven scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Fan-Yun Sun · Weiyu Liu · Siyi Gu · Dylan Lim · Goutam Bhat · Federico Tombari · Manling Li · Nick Haber · Jiajun Wu
Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in dense scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.
Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. With the development of deep learning, current methods primarily rely on inter-image relationships to train supervised models, often overlooking intrinsic information provided by the camera itself. From the perspective of embodied intelligence, perception and understanding are not only based on external data inputs but are also closely linked to the physical environment in which the model is embedded. Following this concept, we propose a method that embeds the camera model and its physical characteristics into a deep learning model to compute Embodied Scene Depth through interactions with road environments. This approach leverages the intrinsic properties of the camera and provides robust depth priors without the need for additional equipment.By combining Embodied Scene Depth with RGB image features, the model gains a comprehensive perspective of both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as another dimension of embodied intelligence, embedding them as scale priors for scene understanding, thus enriching the model’s perception of the scene. This integration of image and language — two inherently ambiguous modalities — leverages their complementary strengths for monocular depth estimation, ensuring a more realistic understanding of scenes in diverse environments. We validated this method on outdoor datasets KITTI and CityScapes, with experimental results demonstrating that this embodied intelligence-based depth estimation method consistently enhances model performance across different scenes.
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang · Ziming Cheng · Junting Pan · Zhaohui Hou · Mingjie Zhan
Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. The models and code will be made available upon publications.
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Jianing "Jed" Yang · Xuweiyi Chen · Nikhil Madaan · Madhavan Iyengar · Shengyi Qian · David Fouhey · Joyce Chai
The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs.
Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration
Lizheng Zu · Lin Lin · Song Fu · Na Zhao · Pan Zhou
Embodied agents based on large language models (LLMs) face significant challenges in collaborative tasks, requiring effective communication and reasonable division of labor to ensure efficient and correct task completion. Previous approaches with simple communication patterns carry erroneous or incoherent agent actions, which can lead to additional risks. To address these problems, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task execution efficiency among embodied agents. CoTS guides multi-agents to discuss long-term strategic plans within a modified Monte Carlo tree, searching along LLM-driven reward functions to provide a more thoughtful and promising approach to cooperation. Another key feature of our method is the introduction of a plan evaluation module, which not only prevents agent action confusion caused by frequent plan updates but also ensures plan updates when the current plan becomes unsuitable. Experimental results show that the proposed method performs excellently in planning, communication, and collaboration on embodied environments (CWAH and TDW-MAT), efficiently completing long-term, complex tasks and significantly outperforming existing methods.
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Aniket Rajiv Didolkar · Andrii Zadaianchuk · Rabiul Awal · Maximilian Seitzer · Efstratios Gavves · Aishwarya Agrawal
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called slots'' orobject files'', where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. We find that the proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.
VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning
Haoran Xu · Peixi Peng · Guang Tan · Yiqian Chang · Luntong Li · Yonghong Tian
Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn policy upon high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with self-supervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle this, we propose \textbf{DGC}, a method that \textbf{d}istills \textbf{g}uidance from Visual Language Models (VLMs) alongside self-supervised learning into a \textbf{c}ompact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel prompting-reasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in the sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance. The source code is available in the supplementary material.
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Byung-Kwan Lee · Ryo Hachiuma · Yu-Chiang Frank Wang · Yong Man Ro · Yueh-Hua Wu
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Han Xiao · yina xie · Guanxin tan · Yinghao Chen · Rui Hu · Ke Wang · Aojun Zhou · Hao Li · Hao Shao · Xudong LU · Peng Gao · Yafei Wen · Xiaoxin Chen · Shuai Ren · Hongsheng Li
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following.Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-the-art MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios.
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Yiqi Zhu · Ziyue Wang · Can Zhang · Peng Li · Yang Liu
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as $\textbf{Continuous Space Perception}$. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the $\textbf{Co}$ntinuous $\textbf{Space}$ perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 16 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang · Yuchang Su · Yiming Liu · Xiaohan Wang · James Burgess · Elaine Sui · Chenyu Wang · Josiah Aklilu · Alejandro Lozano · Anjiang Wei · Ludwig Schmidt · Serena Yeung
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 28 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition
Xuli Shen · Hua Cai · Weilin Shen · Qing Xu · Dingding Yu · Weifeng Ge · Xiangyang Xue
With the explosion of human-machine interaction, emotion recognition has reignited attention. Previous works focus on improving visual feature fusion and reasoning from multiple image levels. Although it is non-trivial to deduce a person's emotion by integrating multi-level feature (head, body and context), the emotion recognition results of each level is usually different from one another, which creates inconsistency in the prevailing feature alignment method and decrease recognition performance. In this work, we propose a multi-level image feature refinement method for emotion recognition (CocoER) to mitigate the impact caused by conflicting results from multi-level recognition. First, we leverage cross-level attention to improve visual feature consistency between hierarchically cropped head, body and context windows. Then, vocabulary informed alignment is incorporated into the recognition framework to produce pseudo label and guide hierarchical visual feature refinement. To effectively fuse multi-level feature, we elaborate on a competition process of eliminating irrelevant image level predictions and a coordination process to enhance the feature across all levels. Extensive experiments are executed on two popular datasets, and our method achieves state-of-the-art performance with multi-level interpretation results.
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
Jian Liang · Wenke Huang · Guancheng Wan · Qu Yang · Mang Ye
While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance.To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge.Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights.Extensive experimental results demonstrate that even at very high degree of sparsity ($\le$ 5\%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.
Seek Common Ground While Reserving Differences: Semi-Supervised Image-Text Sentiment Recognition
Wuyou Xia · Guoli Jia · Sicheng Zhao · Jufeng Yang
Multimodal sentiment analysis has attracted extensive research attention as increasing users share images and texts to express their emotions and opinions on social media. Collecting large amounts of labeled sentiment data is an expensive and challenging task due to the high cost of labeling and unavoidable label ambiguity. Semi-supervised learning (SSL) is explored to utilize the extensive unlabeled data to alleviate the demand for annotation. However, different from typical multimodal tasks, the inconsistent sentiment between image and text leads to the sub-optimal performance of SSL algorithms. To address this issue, we propose SCDR, the first semi-supervised image-text sentiment recognition framework. To better utilize the discriminative features of each modality, we decouple features into common and private parts and then use the private features to train unimodal classifiers for enhanced modality-specific sentiment representation. Considering the complex relation between modalities, we devise a modal selection-based attention module that adaptively assesses the dominant sentiment modality at the sample level to guide the fusion of multimodal representations. Furthermore, to prevent the model predictions from overly relying on common features under the guidance of multimodal labels, we design a pseudo-label filtering strategy based on the matching degree of prediction and dominant modality. Extensive experiments and comparisons on five publicly available datasets demonstrate that SCDR outperforms state-of-the-art methods. The code is provided in the supplementary material and will be released to the public.
Vision-Language Models Do Not Understand Negation
Kumail Alhamoud · Shaden Alshammari · Yonglong Tian · Guohao Li · Philip H.S. Torr · Yoon Kim · Marzyeh Ghassemi
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10\% increase in recall on negated queries and a 40\% boost in accuracy on multiple-choice questions with negated captions.
Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering
Yuanhao Zou · Zhaozheng Yin
Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods such as contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and select relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019. The code will be publicly available.
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Ting Liu · Siyuan Li
Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot referring segmentation models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. The code will be publicly available.
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan · Wang Lin · Zhongqi Yue · Tenglong Ao · Liyu Jia · Wei Zhao · Juncheng Li · Siliang Tang · Hanwang Zhang
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve a new SOTA for multimodal comprehension and generation simultaneously compared with other MLLMs.
UNIALIGN: Scaling Multimodal Alignment within One Unified Model
bo zhou · Liulei Li · Yujia Wang · 刘华峰 Liu · Yazhou Yao · Wenguan Wang
We present UNIALIGN, a unified model to align an arbitrary number of modalities ($\text{e.g.}$, image, text, audio, 3D point cloud, $\textit{etc.}$) through one encoder and a single training phase. Existing solutions typically employ distinct encoders for each modality, resulting in increased parameters as the number of modalities grows. In contrast, UNIALIGN proposes a modality-aware adaptation of the powerful mixture-of-experts (MoE) schema and further integrates it with Low-Rank Adaptation (LoRA), efficiently scaling the encoder to accommodate inputs in diverse modalities while maintaining a fixed computational overhead.Moreover, prior work often requires separate training for each extended modality. This leads to task-specific models and further hinders the communication between modalities.To address this, we propose a soft modality binding strategy that aligns all modalities using unpaired data samples across datasets. Two additional training objectives are introduced to distill knowledge from well-aligned anchor modalities and prior multimodal models, elevating UNIALIGN into a high performance multimodal foundation model.Experiments on 11 benchmarks across 6 different modalities demonstrate that UNIALIGN could achieve comparable performance to SOTA approaches, while using merely 7.8M trainable parameters and maintaining an identical model with the same weight across all tasks. Our code shall be released.
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
zehan wang · Sashuai zhou · Shaoxuan He · Haifeng Huang · Lihe Yang · Ziang Zhang · Xize Cheng · Shengpeng Ji · Tao Jin · Hengshuang Zhao · Zhou Zhao
Contrastive Language-Image Pre-training (CLIP) learns robust visual models through language supervision, making it a crucial visual encoding technique for various applications. However, CLIP struggles with comprehending spatial concepts in images, potentially restricting the spatial intelligence of CLIP-based AI systems. In this work, we propose SpatialCLIP, an enhanced version of CLIP with better spatial understanding capabilities. To capture the intricate 3D spatial relationships in images, we improve both "visual model" and "language supervision" of CLIP. Specifically, we design 3D-inspired ViT to replace the standard ViT in CLIP. By lifting 2D image tokens into 3D space and incorporating design insights from point cloud networks, our visual model gains greater potential for spatial perception. Meanwhile, captions with accurate and detailed spatial information are very rare. To explore better language supervision for spatial understanding, we re-caption images and perturb their spatial phrases as negative descriptions, which compels the visual model to seek spatial cues to distinguish these hard negative captions. With the enhanced visual model, we introduce SpatialLLaVA, following the same LLaVA-1.5 training protocol, to investigate the importance of visual representations for MLLM's spatial intelligence. Furthermore, we create SpatialBench, a benchmark specifically designed to evaluate CLIP and MLLM in spatial reasoning. SpatialCLIP and SpatialLLaVA achieve substantial performance improvements, demonstrating stronger capabilities in spatial perception and reasoning, while maintaining comparable results on general-purpose benchmarks.
Semantic and Expressive Variations in Image Captions Across Languages
Andre Ye · Sebastin Santy · Jena D. Hwang · Amy X Zhang · Ranjay Krishna
Most vision-language models today are primarily trained on English image-text pairs, with non-English pairs often filtered out. Evidence from cross-cultural psychology suggests that this approach will bias models against perceptual modes exhibited by people who speak other (non-English) languages.We investigate semantic and expressive variation in image captions across different languages; we analyze both human-annotated datasets and model-produced captions.By analyzing captions across seven languages (English, French, German, Russian, Chinese, Japanese, Korean) in high-quality image captioning datasets (Crossmodal and Visual Genome), we find that multilingual caption sets tend to provide richer visual descriptions than monolingual (including English-only) ones; multilingual sets contain 46.0% more objects66.1% more relationships, and66.8% more attributes.We observe the same results with multilingual captions produced by LLaVA and the Google Vertex API: for example, compared to monolingual captions, they cover21.9% more objects,18.8% more relations, and20.1% more attributes.These suggest that, across a large number of samples, different languages bias people and models to focus on different visual concepts.Finally, we show that models trained on image-text data in one language perform distinctly better on that language's test set.Our work points towards the potential value of training vision models on multilingual data sources to widen the range/variation of descriptive information those models are exposed to.
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
Quanxing Zha · Xin Liu · Shu-Juan Peng · Yiu-ming Cheung · Xing Xu · Nannan Wang
Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general $\textbf{Re}$lation $\textbf{Con}$sistency learning framework, namely $\textbf{ReCon}$, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://anonymous.4open.science/r/ReCon-NCL.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu · Zheng Liu · Peitian Zhang · Minghao Qin · Junjie Zhou · Zhengyang Liang · Tiejun Huang · Bo Zhao
Long video understanding poses a significant challenge for current Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained by their limited context lengths and the substantial costs while processing long videos. Although several existing methods attempt to reduce visual tokens, their strategies encounter severe bottleneck, restricting MLLMs' ability to perceive fine-grained visual details. In this work, we propose Video-XL, a novel approach that leverages MLLMs' inherent key-value (KV) sparsification capacity to condense the visual input. Specifically, we introduce a new special token, the Visual Summarization Token (VST), for each interval of the video, which summarizes the visual information within the interval as its associated KV. The VST module is trained by instruction fine-tuning, where two optimizing strategies are offered. 1. Curriculum learning, where VST learns to make small (easy) and large compression (hard) progressively. 2. Composite data curation, which integrates single-image, multi-image, and synthetic data to overcome the scarcity of long-video instruction data. The compression quality is further improved by dynamic compression, which customizes compression granularity based on the information density of different video intervals. Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes across multiple popular benchmarks. Second, it effectively preserves video information, with minimal compression loss even at 16x compression ratio. Third, it realizes outstanding cost-effectiveness, enabling high-quality processing of thousands of frames on a single A100 GPU.
Generative Zero-Shot Composed Image Retrieval
Lan Wang · Wei Ao · Vishnu Naresh Boddeti · Ser-Nam Lim
Composed Image Retrieval (CIR) is a vision-language task utilizing queries comprising images and textual descriptions to achieve precise image retrieval. This task seeks to find images that are visually similar to a reference image and incorporate specific changes or features described textually (visual delta). CIR enables a more flexible and user-specific retrieval by bridging visual data with verbal instructions. This paper introduces a novel generative method that augments Composed Image Retrieval by Composed Image Generation (CIG) to provide pseudo-target images. CIG utilizes a textual inversion network to map reference images into semantic word space, which generates pseudo-target images in combination with textual descriptions. These images serve as additional visual information, significantly improving the accuracy and relevance of retrieved images when integrated into existing retrieval frameworks. Experiments conducted across multiple CIR datasets and several baseline methods demonstrate improvements in retrieval performance, which shows the potential of our approach as an effective add-on for existing composed image retrieval.
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
Yuhao Wang · Yongfeng Lv · Pingping Zhang · Huchuan Lu
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Additionally, current methods often directly aggregate multi-modal features without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang · Weizhi Chen · Leyang Yang · Sheng Zhou · Shengchu Zhao · Hanbei Zhan · Jiongchao Jin · Liangcheng Li · Zirui Shao · Jiajun Bu
Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. The rapid development of multi-modal large language models (MLLMs) in recent years has revealed their significant potential in GUI understanding. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current MLLMs already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To tackle these challenges, this paper presents MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modality from GUIs, with spatial structure enhancing strategy and adaptively combined via a fusion gate to meet the distinct requirements of different GUI interpretation tasks. To cope with the scarcity of high-quality data, we also introduce a pipeline for automatically collecting spatial information. Our extensive experiments demonstrate that MP-GUI achieves impressive results on numerous GUI understanding tasks even with a limited amount of generated data.
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Hao Guo · Xugong Qin · Jun Jie Ou Yang · peng zhang · Gangyan Zeng · Yubo Li · Hailun Lin
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieves documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios when using fine-grained semantics from text queries. To bridge this gap, this paper introduces a new benchmark of Natural Language-based Document Image Retrieval (NL-DIR) along with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We propose a two-stage retrieval method for DIR that enhances retrieval performance while optimizing both time and space efficiency. Furthermore, we perform zero-shot and fine-tuning evaluations of existing contrastive vision-language models and OCR-free visual document understanding (VDU) models on this dataset. The datasets and codes will be publicly available to facilitate research in the VDU community.
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models
Yuhao Cui · Xinxing Zu · Wenhua Zhang · Zhongzhou Zhao · Jinyang Gao
Leveraging Large Language Models (LLMs) for text representation has achieved significant success, but the exploration of using Multimodal LLMs (MLLMs) for multimodal representation remains limited. Previous MLLM-based representation studies have primarily focused on unifying the embedding space while neglecting the importance of multimodal alignment. As a result, their cross-modal retrieval performance falls markedly behind that of the CLIP series models. To address this, in our work, we 1) construct DeKon5M, a contrastive learning dataset enriched with dense multimodal knowledge, which efficiently enhances multimodal alignment capabilities in representation tasks. 2) design a framework for training unified representation on MLLMs. Building upon this unified representation framework and the dense knowledge dataset DeKon5M, we developed the dense knowledge representation model DeKR on Qwen2VL. Through extensive quantitative and qualitative experiments, our results demonstrate that DeKR not only aligns text, image, video, and text-image combinations within a unified embedding space but also achieves cross-modal retrieval performance comparable to SoTA CLIP series models. This fully validates the effectiveness of our approach and provides new insights for multimodal representation research.
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang · Yang Yu · Yucheng Chen · Xulei Yang · Si Yong Yeo
Despite significant progress in Vision-Language Pre-training (VLP), existing VLP approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This misalignment constrains the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose \textbf{MedUnifier}, a unified vision-language pre-training framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. It also offers a highly adaptable tool designed for a broad spectrum of language and vision tasks in healthcare, marking advancement toward the development of a genuinely generalizable AI model for medical contexts.
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
Wang Lin · Qingsong Wang · Yueying Feng · Shulei Wang · Tao Jin · Zhou Zhao · Fei Wu · Chang Yao · Jingyuan Chen
Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types of comprehension tasks (classification, retrieval, captioning, and question answering) on 3 types of non-natural images (geometric, charts, and functional).
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Hao Li · Changyao TIAN · Jie Shao · Xizhou Zhu · Zhaokai Wang · Jinguo Zhu · Wenhan Dou · Xiaogang Wang · Hongsheng Li · Lewei Lu · Jifeng Dai
The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, effectively supporting high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Shaoan Xie · Lingjing Kong · Yujia Zheng · Yu Yao · Zeyu Tang · Eric P. Xing · Guangyi Chen · Kun Zhang
Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning.However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard.On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts.In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory.
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang · Chen Ju · Weixiong Lin · Mengting Chen · Shuai Xiao · Yixuan Huang · Chang Liu · mingshuai Yao · Jinsong Lan · Ying Chen · Qingwen Liu · Yanfeng Wang
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks. Code for holistic CLIP will be released upon publication, to further promote the prosperity of VLMs.
Language-Guided Salient Object Ranking
Fang Liu · Yuhao Liu · Ke Xu · Shuquan Ye · Gerhard Hancke · Rynson W.H. Lau
Salient Object Ranking (SOR) aims to study human attention shifts across different objects in the scene. It is a challenging task, as it requires comprehension of the relations among the salient objects in the scene. However, existing works often overlook such relations or model them implicitly. In this work, we observe that when Large Vision-Language Models (LVLMs) describe a scene, they usually focus on the most salient object first, and then discuss the relations as they move on to the next (less salient) one. Based on this observation, we propose a novel Language-Guided Salient Object Ranking approach (named LG-SOR), which utilizes the internal knowledge within the LVLM-generated language descriptions, i.e., semantic relation cues and the implicit entity order cues, to facilitate saliency ranking. Specifically, we first propose a novel Text-Guided Visual Modulation (TGVM) module to incorporate semantic information in the description for saliency ranking. TGVM controls the flow of linguistic information to the visual features, suppresses noisy background image features, and enables propagation of useful textual features. We then propose a novel Text-Aware Visual Reasoning (TAVR) module to enhance model reasoning in object ranking, by explicitly learning a multimodal graph based on the entity and relation cues derived from the description. Extensive experiments demonstrate superior performances of our model on two SOR benchmarks.
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang · Xinpeng Ding · Chunwei Wang · Jianhua Han · Yulong Liu · Hengshuang Zhao · Hang Xu · Lu Hou · Wei Zhang · Xiaodan Liang
High-resolution image inputs allow Large Vision-Language Models (LVLMs) to capture finer visual details, improving comprehension. However, the increased training and computational costs associated with such inputs pose significant challenges. A common approach to mitigate these costs involves slicing the input into uniform patches using sliding windows, each aligned with the vision encoder’s input size. While efficient, this method fragments the input, disrupting the continuity of context, which negatively impacts cross-patch perception tasks. To address these limitations, we propose HiRes-LLaVA, a novel framework designed to efficiently process high-resolution inputs of any size without altering the original contextual and geometric information. HiRes-LLaVA introduces two key components: (i) a SliceRestore Adapter (SRA) that reconstructs sliced patches into their original form, enabling efficient extraction of both global and local features through down-up-sampling and convolutional layers, and (ii) a Self-Mining Sampler (SMS) that compresses visual tokens based on internal relationships, preserving original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related tasks. Extensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and EntityGrid-QA. For example, with SRA, our method achieves a performance improvement of ∼ 12% over state-of-the-art LVLMs in addressing fragmentation issues. Additionally, our SMS outperforms other visual token downsamplers, while offering high data efficiency.
Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang · Jiale Fu · chenduo hao · Xinting Hu · Yingzhe Peng · Xin Geng · Xu Yang
Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors'' added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://anonymous.4open.science/r/MimIC/.
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye · Yukang Gan · Xiaoke Huang · Yixiao Ge · Yansong Tang
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method can achieve a 576 times compression rate while maintaining 83.7% performance. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks.Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications.
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil · Raiymbek Akshulakov · YASSER ABDELAZIZ DAHOU DJILALI · Sanath Narayan · Ankit Singh · Noel O'Connor
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Are semantically similar embedding spaces separated only by simple projection transformations? To validate this, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on various tasks involving both strong unimodal vision (0-shot localization) and language encoders (multi-lingual, long context) and show that simple Projectors retain unimodal capabilities in joint embedding space. Furthermore, our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76(\%) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared to multimodal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets will be released soon
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang · Yunjian Zhang · Yao Zhu · Jianing Li · Zizhe Wang · Yanwei Liu · Xiangyang Ji
Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms.
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao · Zhenting Wang · Felix Juefei-Xu · Xide Xia · Miao Liu · Xiaofang Wang · Mingfu Liang · Ning Zhang · Dimitris N. Metaxas · Licheng Yu
Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
ziang yan · Zhilin Li · Yinan He · Chenting Wang · Kunchang Li · Xinhao Li · Xiangyu Zeng · Zilei Wang · Yali Wang · Yu Qiao · Limin Wang · Yi Wang
Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6\% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models.
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park · Minyeong Kim · Gunhee Kim
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications.
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Wei Suo · Lijun Zhang · Mengyang Sun · Lin Yuanbo Wu · Peng Wang · Yanning Zhang
Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Our code will be released.
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Wenbin An · Feng Tian · Sicong Leng · Jiahao Nie · Haonan Lin · QianYing Wang · Ping Chen · Xiaoqin Zhang · Shijian Lu
Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our data and code will be released.
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Zenghui Yuan · Jiawen Shi · Pan Zhou · Neil Zhenqiang Gong · Lichao Sun
Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose $\textit{BadToken}$, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: $\textit{Token-substitution}$ and $\textit{Token-addition}$, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Joonhyun Jeong · Seyun Bae · Yeonsung Jung · Jaeryong Hwang · Eunho Yang
Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety-alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety-alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety-alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety-alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective in increasing the uncertainty of the model, thereby facilitating the bypass of the safety-alignment. Experimental results across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and GPT-4V with high attack success rate, which previous attack approaches have consistently struggled to jailbreak.
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
Han Wang · Gang Wang · Huan Zhang
Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks.Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against both unseen attacks at design time (i.e., structured-based attacks) and adversarial images from diverse distributions.
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
Lijun Sheng · Jian Liang · Zilei Wang · Ran He
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional visual models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and is often difficult to generalize across tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in supplementary materials.
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Jinhong Deng · Yuhang Yang · Wen Li · Lixin Duan
While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach.
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Jeonghyeon Kim · Sangheum Hwang
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
Matteo Farina · Massimiliano Mancini · Giovanni Iacca · Elisa Ricci
An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the “base” classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.
Bayesian Test-Time Adaptation for Vision-Language Models
Lihua Zhou · Mao Ye · Shuaifeng Li · Nianxin Li · Xiatian Zhu · Lei Deng · Hongbin Liu · Zhen Lei
Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbf{B}ayesian \textbf{C}lass \textbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
Seung Hyun Lee · Jijun jiang · Yiran Xu · Zhuofang Li · Junjie Ke · Yinxiao Li · Junfeng He · Steven Hickson · Katie Datsenko · Sangpil Kim · Ming-Hsuan Yang · Irfan Essa · Feng Yang
The goal of image cropping is to identify visually appealing crops in an image. Conventional methods are trained on specific datasets and fail to adapt to new requirements. Recent breakthroughs in large vision-language models (VLMs) enable visual in-context learning without explicit training. However, downstream tasks with VLMs remain under explored. In this paper, we propose an effective approach to leverage VLMs for image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, we refer to as Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.
ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning
Haoyuan Yang · Xiaoou Li · Jiaming Lv · Xianjun Cheng · Qilong Wang · Peihua Li
Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution. However, existing approaches treat synthetic images merely as complements to real images, rather than as standalone knowledge repositories stemming from distinct foundation models. To overcome this limitation, we reconceptualize synthetic images as an imagined base set, i.e., a unique, large-scale synthetic dataset encompassing diverse concepts. We introduce a novel CLIP adaptation methodology called ImagineFSL, involving pretraining on the imagined base set followed by fine-tuning on downstream few-shot tasks. We find that, compared to no pretraining, both supervised and self-supervised pretraining are beneficial, with the latter providing better performance. Building on this finding, we propose an improved self-supervised method tailored for few-shot scenarios, enhancing the transferability of representations from synthetic to real image domains. Additionally, we present an image generation pipeline that employs chain-of-thought and in-context learning techniques, harnessing foundation models to automatically generate diverse, realistic images. Our methods are validated across eleven datasets, consistently outperforming state-of-the-art methods by substantial margins.
SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
Chenyu Zhang · Kunlun Xu · Zichen Liu · Yuxin Peng · Jiahuan Zhou
Vision-language models (VLMs) exhibit promising generalization capabilities, yet face considerable challenges when adapting to domain shifts stemming from changes in data distributions. Test-time adaptation (TTA) has thus emerged as a promising approach for enhancing VLM performance under such conditions. In practice, test data often arrives in batches, which has led to increasing interest in the transductive TTA setting. Existing TTA methods, however, are typically limited by focusing solely on individual test samples, thereby overlooking the critical cross-sample correlations within a batch. While recent ViT-based TTA methods have started to incorporate batch-level adaptation, they remain suboptimal for VLMs due to insufficient integration of the essential text modality. To bridge key gaps in TTA for VLMs, we propose a novel transductive TTA framework called Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first unsupervisedly forms supportive cliques of test samples based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. The code will be released.
Interpreting Object-level Foundation Models via Visual Precision Search
Ruoyu Chen · Siyuan Liang · Jingzhi Li · Shiming Liu · Maosen Li · Zhen Huang · Hua Zhang · Xiaochun Cao
Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models’ decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7\%, 31.6\%, and 20.1\% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9\% and 66.9\% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics.
Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition
ZHANG LINTONG · Kang Yin · Seong-Whan Lee
Attribution-based explanation techniques capture key patterns to enhance visual interpretability. However, these patterns often lack the granularity needed for insight in fine-grained tasks, particularly in cases of model misclassification, where explanations may be insufficiently detailed. To address this limitation, we propose a fine-grained counterfactual explanation framework that generates both object-level and part-level interpretability, addressing two fundamental questions: (1) which fine-grained features contribute to model misclassification, and (2) where dominant local features influence counterfactual adjustments. Our approach yields explainable counterfactuals in a non-generative manner by quantifying similarity and weighting component contributions within regions of interest between correctly classified and misclassified samples. Furthermore, we introduce an importance-isolation module grounded in Shapley value contributions, isolating features with region-specific relevance. Extensive experiments demonstrate the superiority of our approach in capturing more granular, intuitively meaningful regions, surpassing coarse-grained methods.
Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models
Itay Benou · Tammy Riklin Raviv
Modern deep neural networks have now reached human-level performance across a variety of tasks. However, unlike humans they lack the ability to explain their decisions by showing where and telling what concepts guided them. In this work, we present a unified framework for transforming any vision neural network into a spatially and conceptually interpretable model. We introduce a spatially-aware concept bottleneck layer that projects “black-box” features of pre-trained backbone models into interpretable concept maps, without requiring human labels. By training a classification layer over this bottleneck, we obtain a self-explaining model that articulates which concepts most influenced its prediction, along with heatmaps that ground them in the input image. Accordingly, we name this method “Spatially-Aware and Label-Free Concept Bottleneck Model” (SALF-CBM). Our results show that the proposed SALF-CBM: (1) Outperforms non-spatial CBM methods, as well as the original backbone, on a variety of classification tasks; (2) Produces high-quality spatial explanations, outperforming widely used heatmap-based methods on a zero-shot segmentation task; (3) Facilitates model exploration and debugging, enabling users to query specific image regions and refine the model's decisions by locally editing its concept maps.
VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks
Jinseong Jang · Chunfei Ma · Byeongwon Lee
Deploying high-performing neural networks in resource-constrained environments poses a significant challenge due to the computational demands of large-scale models. We introduce VL2Lite, a knowledge distillation framework designed to enhance the performance of lightweight neural networks in image classification tasks by leveraging the rich representational knowledge from Vision-Language Models (VLMs). VL2Lite directly integrates multi-modal knowledge from VLMs into compact models during training, effectively compensating for the limited computational and modeling capabilities of smaller networks. By transferring high-level features and complex data representations, our approach improves the accuracy and efficiency of image classification tasks without increasing computational overhead during inference. Experimental evaluations demonstrate that VL2Lite achieves up to a 7% improvement in classification performance across various datasets. This method addresses the challenge of deploying accurate models in environments with constrained computational resources, offering a balanced solution between model complexity and operational efficiency.
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
Mert Bülent Sarıyıldız · Philippe Weinzaepfel · Thomas Lucas · Pau de Jorge · Diane Larlus · Yannis Kalantidis
Recent multi-teacher distillation methods have successfully unified the encoders of several foundation models into a single encoder capable of competitive performance on core computer vision tasks, such as classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation -- a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore strategies for data sharing and encoding teacher-specific information and as a result, we obtain a single encoder that excels in challenging tasks spanning 3D understanding, 3D human perception, and 2D vision. The resulting model exhibits strong generalization capabilities and performs on par with its teachers, each one state-of-the-art for a specialized task. Notably, our model outperforms all known methods on the Map-free Visual Relocalization dataset with a highly compact encoder.
Probing the Mid-level Vision Capabilities of Self-Supervised Learning
Xuweiyi Chen · Markus Marks · Zezhou Cheng
Mid-level vision capabilities — such as generic object localization and 3D geometric understanding — are not only fundamental to human vision but are also crucial for many real-world applications of computer vision.These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined.In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.
Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation
Seokil Ham · Hee-Seon Kim · Sangmin Woo · Changick Kim
Despite the growing interest in Mamba architecture as a potential replacement for Transformer architecture, parameter-efficient fine-tuning (PEFT) approaches for Mamba remain largely unexplored. In our study, we introduce two key insights-driven strategies for PEFT in Mamba architecture: (1) While state-space models (SSMs) have been regarded as the cornerstone of Mamba architecture, then expected to play a primary role in transfer learning, our findings reveal that Projectors---not SSMs---are the predominant contributors to transfer learning, and (2) Based on our observation that adapting pretrained Projectors to new tasks can be effectively approximated through a near-diagonal linear transformation, we propose a novel PEFT method specialized to Mamba architecture: Projector-targeted Diagonal-centric Linear Transformation (ProDiaL). ProDiaL focuses on optimizing only diagonal-centric linear transformation matrices, without directly fine-tuning the pretrained Projector weights. This targeted approach allows efficient task adaptation, utilizing less than 1% of the total parameters, and exhibits strong performance across both vision and language Mamba models, highlighting its versatility and effectiveness.
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
Jinqi Xiao · Shen Sang · Tiancheng Zhi · Jing Liu · Qing Yan · Linjie Luo · Bo Yuan
Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61\% with only 2\% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81\% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
Sample- and Parameter-Efficient Auto-Regressive Image Models
Elad Amrani · Leonid Karlinsky · Alex M. Bronstein
We introduce $\textbf{XTRA}$, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data,auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents $k \times k$ tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction.This simple modification yields two key results. First, $\textbf{XTRA is sample-efficient}$. Despite being trained on 152$\times$ fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, $\textbf{XTRA is parameter-efficient}$. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7–16$\times$ fewer parameters (85M vs. 1.36B/0.63B).
Subnet-Aware Dynamic Supernet Training for Neural Architecture Search
Jeimin Jeon · Youngmin Oh · Junghyup Lee · Donghyeon Baek · Dohyung Kim · Chanho Eom · Bumsub Ham
N-shot neural architecture search (NAS) exploits a supernet containing all candidate subnets for a given search space. The subnets are typically trained with a static training strategy (e.g., using the same learning rate (LR) scheduler and optimizer for all subnets). This, however, does not consider that individual subnets have distinct characteristics, leading to two problems: (1) The supernet training is biased towards the low-complexity subnets (unfairness); (2) the momentum update in the supernet is noisy (noisy momentum). We present a dynamic supernet training technique to address these problems by adjusting the training strategy adaptive to the subnets. Specifically, we introduce a complexity-aware LR scheduler (CaLR) that controls the decay ratio of LR adaptive to the complexities of subnets, which alleviates the unfairness problem. We also present a momentum separation technique (MS). It groups the subnets with similar structural characteristics and uses a separate momentum for each group, avoiding the noisy momentum problem. Our approach can be applicable to various N-shot NAS methods with marginal cost, while improving the search performance drastically. We validate the effectiveness of our approach on various search spaces (e.g., NAS-Bench-201, Mobilenet spaces) and datasets (e.g., CIFAR-10/100, ImageNet). Our code will be available online.
DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge
Sabbir Ahmed · Abdullah Al Arafat · Deniz Najafi · Akhlak Mahmood · Mamshad Nayeem Rizve · Mohaiminul Al Nahian · RANYANG ZHOU · Shaahin Angizi · Adnan Rakin Rakin
Vision Transformers (ViTs) excel in tackling complex vision tasks, yet their substantial size poses significant challenges for applications on resource-constrained edge devices. The increased size of these models leads to higher overhead (e.g., energy, latency) when transmitting model weights between the edge device and the server. Hence, ViTs are not ideal for edge devices where the entire model may not fit on the device. Current model compression techniques often achieve high compression ratios at the expense of performance degradation, particularly for ViTs. To overcome the limitations of existing works, we rethink model compression strategy for ViTs from first principle approach and develop an orthogonal strategy called DeepCompress-ViT. The objective of the DeepCompress-ViT is to encode the model weights to a highly compressed encoded representation using a novel training method, denoted as Unified Compression Training (UCT). Proposed UCT is accompanied by a decoding mechanism during inference, which helps to gain any loss of accuracy due to high compression ratio. We further optimize this decoding step by re-ordering the decoding operation using associative property of matrix multiplication, ensuring that the compressed weights can be decoded during inference without incurring any computational overhead. Our extensive experiments across multiple ViT models on modern edge devices show that DeepCompress-ViT can successfully compress ViTs at high compression ratios ($>14\times$). DeepCompress-ViT enables the entire model to be stored on the edge device, resulting in unprecedented reductions in energy consumption ($>1500\times$) and latency ($>200\times$) for edge ViT inference.
Effective SAM Combination for Open-Vocabulary Semantic Segmentation
Minhyeok Lee · Suhwan Cho · Jungho Lee · Sunghun Yang · Heeseung Choi · Ig-Jae Kim · Sangyoun Lee
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. Additionally, a Vision-Language Fusion (VLF) module enhances the final mask prediction through image and text guidance. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
Feng Wang · Timing Yang · Yaodong Yu · Sucheng Ren · Guoyizhe Wei · Angtian Wang · Wei Shao · Yuyin Zhou · Alan L. Yuille · Cihang Xie
In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT and Vim, Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8x and 6.2x faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images.
Dynamic Group Normalization: Spatio-Temporal Adaptation to Evolving Data Statistics
Yair Smadar · Assaf Hoogi
Deep neural networks remain vulnerable to statistical variations in data despite advances in normalization techniques. Current approaches rely on fixed static normalization sets, fundamentally limiting their ability to adapt to dynamic data distributions. We introduce Dynamic Group Normalization (DGN), which treats channel grouping as a learnable component and leverages statistical awareness to form coherent groups adaptively. By employing an efficient spatio-temporal mechanism that continuously evaluates inter-channel relationships both within layers and across training epochs, DGN enables robust adaptation to evolving data distributions.Extensive evaluations across 24 architectures and 8 computer vision benchmarks demonstrate DGN's consistent superiority. Beyond achieving significant accuracy gains in classification, detection, and segmentation tasks while maintaining computational efficiency, DGN particularly excels in challenging scenarios where traditional methods struggle—notably in Out-Of-Distribution generalization and imbalanced data distributions.
Frequency Dynamic Convolution for Dense Image Prediction
Linwei Chen · Lin Gu · Liang Li · Chenggang Yan · Ying Fu
While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability.In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content.Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M).Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code will be made publicly available upon acceptance.
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
Kwonyoung Kim · Jungin Park · Jin Kim · Hyeongjun Kwon · Kwanghoon Sohn
Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. In practice, PET requires much smaller storage and transmission cost for each task than traditional fine-tuning methods, which require updating whole parameters, regardless of exponentially increasing pre-trained model capacity. However, most existing PET methods inherit the latency associated with their large backbones and often require additional computation due to additional modules (e.g. adapter) during inference, making them less practical on computation-intensive applications. In this paper, we propose a Faster Parameter-Efficient Tuning (FPET) method to achieve high inference speed and computation efficiency while keeping storage efficiency high. Specifically, we introduce a plug-and-play token redundancy reduction module delicately engineered for PET. The proposed module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the token count through a token merging strategy. We formulate token merging to be fully differentiable using a straight-through estimator, making token redundancy reduction optimal. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.
Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models
Yan Xie · Zequn Zeng · Hao Zhang · Yucheng Ding · Yi Wang · Zhengjue Wang · Bo Chen · Hongwei Liu
Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations, hence decreasing model reliability; and ii) though CBMs could explain the importance of every concept to the final prediction, it is still challenging to tell which visual region produces the prediction. To solve these problems, this paper proposes a Disentangled Optimal Transport CBM (DOT-CBM) framework to explore fine-grained visual-concept relations between local image patches and concepts. Specifically, we model the concept prediction process as a transportation problem between the patches and concepts, thereby achieving explicit fine-grained feature alignment. We also incorporate orthogonal projection losses within the modality to enhance local feature disentanglement. To further address the shortcut issues caused by statistical biases in the data, we utilize the visual saliency map and concept label statistics as transportation priors. Thus, DOT-CBM can visualize inversion heatmaps, provide more reliable concept predictions, and produce more accurate class predictions.Comprehensive experiments demonstrate that our proposed DOT-CBM achieves SOTA performance on several tasks, including image classification, local part detection and out-of-distribution generalization. Codes are available in supplementary material.
TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction
Aishwarya Agarwal · Srikrishna Karanam · Vineet Gandhi
We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-of-the-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted.
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, (iii) custom grouping encoders, and (iv) the Segment Anything Model (SAM). In this paper, we introduce S-Seg, a simple model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improving when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research. Our code and demo will be made publicly available soon.
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Lanyun Zhu · Tianrun Chen · Qianxiong Xu · Xuanyi Liu · Deyi Ji · Haiyang Wu · De Soh Soh · Jun Liu
Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM.
Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation
Songsong Duan · Xi Yang · Nannan Wang
Existing Weakly Supervised Semantic Segmentation (WSSS) relies on the CNN-based Class Activation Map (CAM) and Transformer-based self-attention map to generate class-specific masks for semantic segmentation. However, CAM and self-attention maps usually cause incomplete segmentation due to classification bias issue. To address this issue, we propose a Multi-Label Prototype Visual Spatial Search (MuP-VSS) method with a spatial query mechanism, which learns a set of learnable class token vectors as queries to search the similarity visual tokens from image patch tokens. Specifically, MuP-VSS consists of two key components: \textbf{multi-label prototype representation} and \textbf{multi-label prototype optimization}. The former designs a global embedding to learn the global tokens from the images, and then proposes a Prototype Embedding Module (PEM) to interact with patch tokens to understand the local semantic information. The latter utilizes the exclusivity and consistency principles of the multi-label prototypes to design three prototype losses to optimize them, which contain cross-class prototype (CCP) contrastive loss, cross-image prototype (CIP) contrastive loss, and patch-to-prototype (P2P) consistency loss. CCP loss models exclusivity of multi-label prototypes learned from a single image to enhance the discriminative properties of each class better. CCP loss learns the consistency of the same class-specific prototypes extracted from multiple images to enhance the semantic consistency. P2P loss is proposed to control the semantic response of the prototype to the image patches. Experimental results on Pascal VOC 2012 and MS COCO show that MuP-VSS significantly outperforms recent methods and achieves state-of-the-art performance.
HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving
Farchan Hakim Raswa · Chun-Shien Lu · Jia-Ching Wang
Federated learning for pathological whole slide image (WSI) classification allows multiple clients to train a global multiple instance learning (MIL) model without sharing their privacy-sensitive WSIs.To accommodate the non-independent and identically distributed (non-i.i.d.) feature shifts, cross-client style transfer has been popularly used but is subject to two fundamental issues: (1) WSIs contain multiple morphological structures due to tissue heterogeneity, and (2) the region of interests (RoIs) is not guaranteed, particularly after augmenting local WSIs data trough style transfer. To address these challenges, we propose HistoFS, a federated learning framework for computational pathology on non-i.i.d. feature shifts in WSI classification. Specifically, we introduce pseudo bag styles that capture multiple style variations within a single WSI. In addition, an authenticity module is introduced to ensure that RoIs are preserved, allowing local models to learn WSIs with diverse styles while maintaining essential RoIs. Extensive experiments validate the superiority of HistoFS over state-of-the-art methods on three clinical datasets.
FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation
Ziqian Yang · Xinqiao Zhao · Xiaolei Wang · Quan Zhang · Jimin Xiao
Image-level Weakly Supervised Semantic Segmentation (WSSS) has garnered significant attention due to its low annotation costs. Current single-stage state-of-the-art WSSS methods mainly reply on ViT to extract features from input images, generating more complete segmentation results based on comprehensive semantic information. However, these ViT-based methods often suffer from over-smoothing issues in segmentation results. In this paper, we identify that attenuated high-frequency features mislead the decoder of ViT-based WSSS models, resulting in over-smoothed false segmentation. To address this, we propose a Frequency Feature Rectification (FFR) framework. Quantitative and qualitative experimental results demonstrate that our FFR framework can effectively address the attenuated high-frequency caused over-smoothed segmentation issue and achieve new state-of-the-art WSSS performances. Codes will be released.
Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation
Qingchen Tang · Lei Fan · Maurice Pagnucco · Yang Song
Weakly supervised image segmentation with image-level labels has drawn attention due to the high cost of pixel-level annotations. Traditional methods using Class Activation Maps (CAMs) often highlight only the most discriminative regions, leading to incomplete masks. Recent approaches that introduce textual information struggle with histopathological images due to inter-class homogeneity and intra-class heterogeneity. In this paper, we propose a prototype-based image prompting framework for histopathological image segmentation. It constructs an image bank from the training set using clustering, extracting multiple prototype features per class to capture intra-class heterogeneity. By designing a matching loss between input features and class-specific prototypes using contrastive learning, our method addresses inter-class homogeneity and guides the model to generate more accurate CAMs. Experiments on four datasets (LUAD-HistoSeg, BCSS-WSSS, GCSS, and BCSS) show that our method outperforms existing weakly supervised segmentation approaches, setting new benchmarks in histopathological image segmentation.
Pay Attention to the Foreground in Object-Centric Learning
Pinzhuo Tian · Shengjie Yang · Hang Yu · Alex C. Kot
The slot attention-based method is widely used in unsupervised object-centric learning, which aims to decompose scenes into interpretable objects and associate them with slots. However, complex backgrounds in the real images can disrupt the model’s focus, leading it to excessively segment background stuff into different regions based on low-level information such as color or texture variations. Consequently, the elaborate segmentation of foreground objects will be neglected, which requires detailed shape or geometric information.To address this issue, we introduce a contrastive learning-based indicator designed to differentiate between foreground and background. Integrating this indicator into the slot attention-based method allows the model to focus more effectively on segmenting foreground objects and minimize background distractions. During the testing phase, we utilize a spectral clustering mechanism to refine the results and mitigate oversegmentation according to the similarity between the slots.Experimental results show that incorporating our method with various state-of-the-art models significantly improves their performance on both simulated data and real-world datasets. Furthermore, multiple sets of ablation experiments confirm the effectiveness of each proposed component. Our code will be made available.
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
Jianyang Zhang · Qianli Luo · Guowu Yang · Wenjing Yang · Weide Liu · Guosheng Lin · Fengmao Lv
Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 8 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept set will be released upon acceptance.
LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning
Peng Wu · Xiankai Lu · Hao Hu · Yongqin Xian · Jianbing Shen · Wenguan Wang
Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning the primitive concepts (i.e., attribute and object) from the training set. While recent works achieve impressive results in CZSL by leveraging large vision-language models like CLIP, they ignore the rich semantic relationships between primitive concepts and their compositions. In this work, we propose LOGICZSL, a novel logic-induced learning framework to explicitly model the semantic relationships. Our logic-induced learning framework formulates the relational knowledge constructed from large language models as a set of logic rules, and grounds them onto the training data. Our logic-induced losses are complementary to the widely used CZSL losses, therefore can be employed to inject the semantic information into any existing CZSL methods. Extensive experimental results show that our method brings significant performance improvements across diverse datasets (i.e., CGQA, UT-Zappos50K, MIT-States) with strong CLIP-based methods and settings (i.e., Close World, Open World). Codes will be publicly released.
CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning
Xiaokun Li · Yaping Huang · Qingji Guan
Fine-grained open-set semi-supervised learning (OSSL) investigates a practical scenario where unlabeled data may contain fine-grained out-of-distribution (OOD) samples. Due to the subtle visual differences among in-distribution (ID) samples, as well as between ID and OOD samples, it is extremely challenging to separate ID and OOD samples. Recent Vision-Language Models, such as CLIP, have shown excellent generalization capabilities. However, it tends to focus on general attributes, and thus is insufficient to distinguish the fine-grained details. To tackle the issues, in this paper, we propose a novel CLIP-driven coarse-to-fine semantic-guided framework, named CFSG-CLIP, by progressively filtering and focusing the distinctive fine-grained clues. Specifically, CFSG-CLIP comprises a coarse-guidance module and a fine-guidance module derived from the pre-trained CLIP model. In the coarse-guidance module, we design a semantic filtering strategy to initially filter out local visual features guided by cross-modality guidance. Then, in the fine-guidance module, we further design a visual-semantic injection strategy, which embeds category-related visual cues into the visual encoder to further refine the local visual features. By the designed dual-guidance framework, the local subtle cues are progressively discovered to distinct the subtle difference between ID and OOD samples. Extensive experiments demonstrates that CFSG-CLIP is able to not only improve the reliability of the fine-grained semi-supervised learning training process, but also achieves a competitive performance on multiple fine-grained datasets.
Less Attention is More: Prompt Transformer for Generalized Category Discovery
Wei Zhang · Baopeng Zhang · Zhu Teng · Wenxin Luo · Junnan Zou · Jianping Fan
Generalized Category Discovery (GCD) typically relies on the pre-trained Vision Transformer (ViT) to extract features from a global receptive field, followed by contrastive learning to simultaneously classify unlabeled known classes and unknown classes without priors. Owing to the deficiency in the modeling capacity for inner-patch local information within ViT, current methods primarily focus on discriminative features at global level. This results in a model with more yet scattered attention, where neither excessive nor insufficient focus can grasp subtle differences to classify fine-grained unknown and known categories. To address this issue, we propose the AptGCD to deliver apt attention for GCD. It mimics the human brain how leveraging visual perception to refine local attention and comprehend global context by proposing a Meta Visual Prompt (MVP) and Prompt Transformer (PT). MVP is introduced into GCD for the first time, refining channel-level attention, while adaptively self-learning unique inner-patch features as prompts to achieve local visual modeling for our prompt transformer. Yet, relying solely on detailed features can lead to skewed judgments. Hence, PT harmonizes local and global representations, guiding the model's interpretation of features through broader contexts, thereby capturing more useful details with less attention. Extensive experiments on seven datasets demonstrate that AptGCD outperforms current methods, it achieves an average proportional 'New' accuracy improvement of approximately 9.2% over SOTA method on the all four fine-grained datasets, establishing a new standard in the field.
Open-World Objectness Modeling Unifies Novel Object Detection
Shan Zhang · Yao Ni · Jinhao Du · Yuan Xue · Philip H.S. Torr · Piotr Koniusz · Anton van den Hengel
The challenge in open-world object detection, as in many few- and zero-shot learning problems, is to generalize beyond the class distribution of the training data. We thus propose a general class-agnostic objectness measure to reduce bias toward labeled samples. To prevent previously unseen objects from being filtered as background or misclassified as known categories by classifers, we explicitly model the joint distribution of objectness and category labels using variational approximation. Without sufficient labeled data, minimizing the KL divergence between the estimated posterior and a static normal prior fails to converge, however. Theoretical analysis illuminates the root cause and motivates adopting a Gaussian prior with variance dynamically adapted to the estimated posterior as a surrogate. To further reduce misclassification, we introduce an energy-based margin loss that encourages unknown objects to move toward high-density regions of the distribution, thus reducing the uncertainty of unknown detections. We introduce an energy-based Open-World OBJectness modeling (OWOBJ) to boost novel object detection, especially in low-data settings. As a flexible plugin, OWOBJ outperforms baselines in Open-World, Few-Shot, and zero-shot Open-Vocabulary Object Detection. Code will be released upon acceptance.
Activating Sparse Part Concepts for 3D Class Incremental Learning
Zhenya Tian · Jun Xiao · Liu lupeng · Haiyong Jiang
This work tackles the challenge of 3D Class-Incremental Learning (CIL), where a model must learn to classify new 3D objects while retaining knowledge of previously learned classes. Existing methods often struggle with catastrophic forgetting, misclassifying old objects due to overreliance on shortcut local features. Our approach addresses this issue by learning a set of part concepts for part-aware features. Particularly, we only activate a small subset of part concepts for the feature representation of each part-aware feature. This facilitates better generalization across categories and mitigates catastrophic forgetting. We further improve the task-wise classification through a part relation-aware Transformer design. At last, we devise learnable affinities to fuse task-wise classification heads and avoid confusion among different tasks. We evaluate our method on three 3D CIL benchmarks, achieving state-of-the-art performance. (Code and data will be released)
Learning Endogenous Attention for Incremental Object Detection
Xiang Song · Yuhang He · Jingyuan Li · Qiang Wang · Yihong Gong
In this paper, we focus on a challenging Incremental Object Detection (IOD) problem. Existing IOD methods follow an image-to-annotation alignment paradigm, which attempts to complete the annotations for old categories and subsequently learns both new and old categories in new tasks. This paradigm inherently introduces missing/redundant/inaccurate annotations of old categories, resulting in a suboptimal performance. Instead, we propose a novel annotation-to-instance alignment IOD paradigm and develop a corresponding method named Learning Endogenous Attention (LEA). Inspired by the human brain, LEA enables the model to focus on annotated task-specific objects, while ignoring irrelevant ones, thus solving the annotation incomplete problem in IOD. Concretely, our LEA consists of Endogenous Attention Modules (EAMs) and an Energy-based Task Modulator (ETM). During training, we add the dedicated EAMs for each new task and train them to focus on the new categories. During testing, ETM predicts task IDs using energy functions, directing the model to detect task-specific objects. The detection results corresponding to all task IDs are combined as the final output, thereby alleviating the catastrophic forgetting of old knowledge. Extensive experiments on COCO 2017 and Pascal VOC 2007 demonstrate the effectiveness of our method.
UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning
Weiqi Yan · Lvhai Chen · Huaijia Kou · Shengchuan Zhang · Yan Zhang · Liujuan Cao
Unsupervised Camoflaged Object Detection (UCOD) has gained attention since it doesn't need to rely on extensive pixel-level labels. Existing UCOD methods typically generate pseudo-labels using fixed strategies and train $1 \times 1$ convolutional layers as a simple decoder, leading to low performance compared to fully-supervised methods. We emphasize two drawbacks in these approaches: 1). The model is prone to fitting incorrect knowledge due to the pseudo-label containing substantial noise. 2). The simple decoder fails to capture and learn the semantic features of camouflaged objects, especially for small-sized objects, due to the low-resolution pseudo-labels and severe confusion between foreground and background pixels. To this end, we propose a UCOD method with a teacher-student framework via Dynamic Pseudo-label Learning called UCOD-DPL, which contains an Adaptive Pseudo-label Module (APM), a Dual-Branch Adversarial (DBA) decoder, and a Look-Twice mechanism. The APM module adaptively combines pseudo-labels generated by fixed strategies and the teacher model to prevent the model from overfitting incorrect knowledge while preserving the ability for self-correction; the DBA decoder takes adversarial learning of different segmentation objectives, guides the model to overcome the foreground-background confusion of camouflaged objects, and the Look-Twice mechanism mimics the human tendency to zoom in on camouflaged objects and performs secondary refinement on small-sized objects. Extensive experiments show that our method demonstrates outstanding performance, even surpassing some existing fully supervised methods. Our code will be released soon.
Feature Information Driven Position Gaussian Distribution Estimation for Tiny Object Detection
Jinghao Bian · Mingtao Feng · Weisheng Dong · Fangfang Wu · Jianqiao Luo · Yaonan Wang · Guangming Shi
Tiny object detection remains challenging in spite of the success of generic detectors. The dramatic performance degradation of generic detectors on tiny objects is mainly due to the the weak representations of extremely limited pixels. To address this issue, we propose a plug-and-play architecture to enhance the extinguished regions. We for the first time exploit the regions to be enhanced from the perspective of pixel-wise amount of information. Specifically, we model the entire image pixels feature information by minimizing Information Entropy loss, generating an information map to attentively highlight weak activated regions in an unsupervised way. To effectively assist the above phase with more attention to tiny objects, we next introduce the Position Gaussian Distribution Map, explicitly modeled using a Gaussian Mixture distribution, where each Gaussian component's parameters depend on the position and size of object instance labels, serving as supervision for further feature enhancement. Taking the information map as prior knowledge guidance, we construct a multi-scale position gaussian distribution map prediction module, simultaneously modulating the information map and distribution map to focus on tiny objects during training. Extensive experiments on three public tiny object datasets demonstrate the superiority of our method over current state-of-the-art competitors. The code is available at https://anonymous.4open.science/r/Information_TOD-19F1.
A Unified, Resilient, and Explainable Adversarial Patch Detector
Vishesh Kumar · Akshay Agarwal
Deep Neural Networks (DNNs), backbone architecture in `almost' every computer vision task, are vulnerable to adversarial attacks, particularly physical out-of-distribution (OOD) adversarial patches. Existing models often struggle with interpreting these attacks in ways that align with human visual perception. Our proposed AdvPatchXAI introduces a generalized, robust, and explainable defense algorithm specifically designed to defend DNNs against physical adversarial threats. AdvPatchXAI employs a novel patch decorrelation loss that reduces feature redundancy and enhances the distinctiveness of patch representations, enabling better generalization across unseen adversarial scenarios. It learns prototypical parts in a self-supervised fashion, enhancing interpretability and correlation with human vision. The model utilizes a sparse linear layer for classification, making the decision-making process globally interpretable through a set of learned prototypes and locally explainable by pinpointing relevant prototypes within an image. Our comprehensive evaluation shows that AdvPatchXAI not only closes the ``semantic'' gap between latent space and pixel space but also effectively handles unseen adversarial patches even perturbed with unseen corruptions, thereby significantly advancing DNN robustness in practical settings.
Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Zhen Qu · Xian Tao · Xinyi Gong · ShiChen Qu · Qiyu Chen · Zhengtao Zhang · Xingang Wang · Guiguang Ding
Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) Hand-crafted text prompts depend heavily on expert knowledge and require extensive trial and error; 2) The single-form learnable prompts is insufficient to capture the complex semantics of anomalies; and 3) The prompt space is poorly constrained, leading to suboptimal generalization performance on unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and enhance model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Experimental results demonstrate that our method achieves state-of-the-art performance in ZSAD across 15 public industrial and medical anomaly detection datasets. Code will be released upon acceptance.
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
wenqiao Li · Yao Gu · Xintao Chen · Xiaohao Xu · Ming Hu · Xiaonan Huang · Yingna Wu
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential.To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality.We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our dataset and benchmark will be publicly available.
Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
Ying Jin · Jinlong Peng · Qingdong He · Teng Hu · Jiafu Wu · Hao Chen · Haoxuan Wang · wenbing zhu · Mingmin Chi · Jun Liu · Yabiao Wang
The performance of anomaly inspection in industrial manufacturing is constrained by the scarcity of anomaly data. To overcome this challenge, researchers have started employing anomaly generation approaches to augment the anomaly dataset. However, existing anomaly generation methods suffer from limited diversity in the generated anomalies and struggle to achieve a seamless blending of this anomaly with the original image. Moreover, the generated mask is usually not aligned with the generated anomaly. In this paper, we overcome these challenges from a new perspective, simultaneously generating a pair of the overall image and the corresponding anomaly part. We propose DualAnoDiff, a novel diffusion-based few-shot anomaly image generation model, which can generate diverse and realistic anomaly images by using a dual-interrelated diffusion model, where one of them is employed to generate the whole image while the other one generates the anomaly part. Moreover, we extract background and shape information to mitigate the distortion and blurriness phenomenon in few-shot image generation. Extensive experiments demonstrate the superiority of our proposed model over state-of-the-art methods in terms of diversity, realism and the accuracy of mask. Overall, our approach significantly improves the performance of downstream anomaly inspection tasks, including anomaly detection, anomaly localization, and anomaly classification tasks. Code will be made available.
LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table
Yusuke Matsui
Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, it is desirable for search results to be similar to the query and diverse among themselves. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cut-off table summarizing vectors that are close to each other. During the filtering, LotusFilter greedy looks up the table to delete redundant vectors from the candidates. We demonstrated that the proposed filter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings.
FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models
Haokun Chen · Hang Li · Yao Zhang · Jinhe Bi · Gengyuan Zhang · Yueqi Zhang · Philip H.S. Torr · Jindong Gu · Denis Krompaß · Volker Tresp
One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM's pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client's local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with feature space heterogeneity, as well as on challenging medical and satellite image datasets with label heterogeneity. The results demonstrate the effectiveness of FedBiP, which substantially outperforms other OSFL methods.
Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
Kai Wang · Zekai Li · Zhi-Qi Cheng · Samir Khaki · Ahmad Sajedi · Ramakrishna Vedantam · Konstantinos N. Plataniotis · Alexander G. Hauptmann · Yang You
Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose a novel approach that \textbf{e}mphasizes the \textbf{d}iscriminative \textbf{f}eatures (obtained by Grad-CAM) for dataset distillation, called \textbf{EDF}. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. Notably, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to enhance the practicality and efficacy of DD. Our code and benchmark will be made public.
Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation
Xinhao Zhong · Hao Fang · Bin Chen · Xulin Gu · Meikang Qiu · Shuhan Qi · Shu-Tao Xia
Dataset distillation is an emerging dataset reduction method, which condenses large-scale datasets while maintaining task accuracy. Current parameterization methods achieve enhanced performance under extremely high compression ratio by optimizing determined synthetic dataset in informative feature domain. However, they limit themselves to a fixed optimization space for distillation, neglecting the diverse guidance across different informative latent spaces. To overcome this limitation, we propose a novel parameterization method dubbed Hierarchical Parameterization Distillation (H-PD), to systematically explore hierarchical feature within provided feature space (e.g., layers within pre-trained generative adversarial networks). We verify the correctness of our insights by applying the hierarchical optimization strategy on GAN-based parameterization method. In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation, bridging the gap between synthetic and original datasets. Experimental results demonstrate that the proposed H-PD achieves a significant performance improvement under various settings with equivalent time consumption, and even surpasses current generative distillation using diffusion models under extreme compression ratios IPC=1 and IPC=10.
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
Weixiang Zhang · Shuzhao Xie · Chengwei Ren · Siyi Xie · Chen Tang · Shijia Ge · Mingzi Wang · Zhi Wang
We propose EVOlutionary Selector (EVOS), an efficient training paradigm for accelerating Implicit Neural Representation (INR). Unlike conventional INR training that feeds all samples through the neural network in each iteration, our approach restricts training to strategically selected points, reducing computational overhead by eliminating redundant forward passes.Specifically, we treat each sample as an individual in an evolutionary process, where only those fittest ones survive and merit inclusion in training, adaptively evolving with the neural network dynamics. While this is conceptually similar to Evolutionary Algorithms, their distinct objectives (selection for acceleration vs. iterative solution optimization) require a fundamental redefinition of evolutionary mechanisms for our context.In response, we design sparse fitness evaluation, frequency-guided crossover, and augmented unbiased mutation to comprise EVOS. These components respectively guide sample selection with reduced computational cost, enhance performance through frequency-domain balance, and mitigate selection bias from cached evaluation. Extensive experiments demonstrate that our method achieves approximately 48\%-66\% reduction in training time while ensuring superior convergence without additional cost, establishing state-of-the-art acceleration among recent sampling-based strategies.
Learning from Neighbors: Category Extrapolation for Long-Tail Learning
Shizhen Zhao · Xin Wen · Jiahui Liu · Chuofan Ma · Chunfeng Yuan · Xiaojuan Qi
Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set fine-grained classes that are related to existing ones, aiming to enhance representation learning for both head and tail classes. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary categories are masked out, leaving only the target class weights for use. Extensive experiments on three standard long-tail benchmarks demonstrate the effectiveness of our approach, notably outperforming strong baseline methods that use the same amount of data. The code will be made publicly available.
PLeaS - Merging Models with Permutations and Least Squares
Anshul Nasery · Jonathan Hayase · Pang Wei Koh · Sewoong Oh
The democratization of machine learning systems has made the process of fine-tuning accessible to practitioners, leading to a wide range of open-source models fine-tuned on specialized tasks and datasets. Recent work has proposed to merge such models to combine their functionalities. However, prior approaches are usually restricted to models that are fine-tuned from the same base model. Furthermore, the final merged model is typically required to be of the same size as the original models. In this work, we propose a new two-step algorithm to merge models---termed PLeaS---which relaxes these constraints.First, leveraging the Permutation symmetries inherent in the two models, PLeaS partially matches nodes in each layer by maximizing alignment. Next, PLeaS computes the weights of the merged model as a layer-wise Least Squares solution to minimize the approximation error between the features of the merged model and the permuted features of the original models. PLeaS allows a practitioner to merge two models sharing the same architecture into a single performant model of a desired size, even when the two original models are fine-tuned from different base models. We also demonstrate how our method can be extended to address a challenging scenario where no data is available from the fine-tuning domains. We demonstrate our method to merge ResNet and ViT models trained with shared and different label spaces, and show improvement over the state-of-the-art merging methods of up to 15 percentage points for the same target compute while merging models trained on DomainNet and fine-grained classification tasks.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
Jiayi Guo · Zhao Junhao · Chaoqun Du · Yulin Wang · Chunjiang Ge · Zanlin Ni · Shiji Song · Humphrey Shi · Gao Huang
Test-time adaptation (TTA) aims to improve the performance of source-domain pre-trained models on previously unseen, shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. The recently proposed diffusion-driven TTA methods mitigate this by adapting model inputs instead of weights, where an unconditional diffusion model, trained on the source domain, transforms target-domain data into a synthetic domain that is expected to approximate the source domain. However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a Synthetic-Domain Alignment (SDA) framework. Our key insight is to fine-tune the source model with synthetic data to ensure better alignment. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This Mix of Diffusion (MoD) process mitigates the potential domain misalignment between the conditional and unconditional models. Extensive experiments across classifiers, segmenters, and multimodal large language models (MLLMs, \eg, LLaVA) demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code will be open-sourced.
SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity
Ke Ma · Jiaqi Tang · Bin Guo · Fan Dang · Sicong Liu · Zhui Zhu · Lei Wu · Cheng Fang · Ying-Cong Chen · Zhiwen Yu · Yunhao Liu
Despite the growing integration of deep models into mobile and embedded terminals, the accuracy of these models often declines significantly during inference due to various deployment interferences. Test-time adaptation (TTA) has emerged as an effective strategy to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in memory-constrained IoT terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements during fully test-time adaptation (FTTA) without relying on specific network architectures or modifications to the original training procedure. Specifically, we propose a novel dynamic activation sparsity strategy that directly prunes activations at layer-specific dynamic ratios, allowing for flexible control of learning ability and memory cost in a data-sensitive manner during adaptation. Among this, two metrics, Gradient Importance and Layer Activation Memory, are considered to determine the layer-wise activation pruning ratios, reflecting accuracy contribution and memory efficiency, respectively. Experimentally, our method surpasses previous TTA baselines by not only reducing memory usage but also achieving superior accuracy, delivering SOTA performance across diverse datasets, network architectures, and tasks.
Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time Adaptation
Qiang Zhang · Mengsheng Zhao · Jiawei Liu · Fanrui Zhang · Yongchao Xu · Zheng-Jun Zha
Test-time adaptation using vision-language model (such as CLIP) to quickly adjust to distributional shifts of downstream tasks has shown great potential. Despite significant progress, existing methods are still limited to single-task test-time adaptation scenarios and have not effectively explored the issue of multi-task adaptation. To address this practical problem, we propose a novel Hierarchical Knowledge Prompt Tuning (HKPT) method, which achieves joint adaptation to multiple target domains by mining more comprehensive source domain discriminative knowledge and hierarchically modeling task-specific and task-shared knowledge. Specifically, HKPT constructs a CLIP prompt distillation framework that utilizes the broader source domain knowledge of large teacher CLIP to guide prompt tuning for lightweight student CLIP from multiple views during testing. Meanwhile, HKPT establishes task-specific dual dynamic knowledge graph to capture fine-grained contextual knowledge from continuous test data. And to fully exploit the complementarity among multiple target tasks, HKPT employs an adaptive task grouping strategy for achieving inter-task knowledge sharing. Furthermore, HKPT can seamlessly transfer to basic single-task test-time adaptation scenarios while maintaining robust performance. Extensive experimental results in both multi-task and single-task test-time adaptation settings demonstrate that our HKPT significantly outperforms state-of-the-art methods.
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
Jiangpeng He · Zhihao Duan · Fengqing Zhu
Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining task-shared adapters to learn cross-task knowledge and task-specific adapters to capture the unique feature of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigates inter-task interference while maintaining the model's plasticity. Through comprehensive experiments across multiple benchmark datasets, we demonstrate that CL-LoRA consistently outperforms state-of-the-art methods while using fewer trainable parameters, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.
Dynamic Integration of Task-Specific Adapters for Class Incremental Learning
Jiashuo Li · Shaokun Wang · Bo Qian · Yuhang He · Xing Wei · Qiang Wang · Yihong Gong
Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues.However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment.TSAI boosts compositionality through a patch-level adapter integration strategy, aggregating richer task-specific information while maintaining low computation costs.Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, on the one hand, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. On the other hand, the PFR promotes classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge, thereby preserving well-calibrated decision boundaries.Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on NECIL benchmark datasets while maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.
Task-Specific Gradient Adaptation for Few-Shot One-Class Classification
Yunlong Li · Xiabi Liu · Liyuan Pan · Yuchen Ren
Optimization-based meta-learning methods for few-shot one-class classification (FS-OCC) aim to fine-tune a meta-trained model to classify the positive and negative samples using only a few positive samples by adaptation. However, recent approaches primarily focus on adjusting existing meta-learning algorithms for FS-OCC, while overlooking issues stemming from the misalignment between the cross-entropy loss and OCC tasks during adaptation. This misalignment, combined with the limited availability of one-class samples and the restricted diversity of task-specific adaptation, can significantly exacerbate the adverse effects of gradient instability and generalization. To address these challenges, we propose a novel \textbf{T}ask-\textbf{S}pecific \textbf{G}radient \textbf{A}daptation (\textbf{TSGA}) for FS-OCC. Without extra supervision, TSGA learns to generate appropriate, stable gradients by leveraging label prediction and feature representation details of one-class samples and refines the adaptation process by recalibrating task-specific gradients and regularization terms. We evaluate TSGA on three challenging datasets and a real-world CNC Milling Machine application and demonstrate consistent improvements over baseline methods. Furthermore, we illustrate the critical impact of gradient instability and task-agnostic adaptation. Notably, TSGA achieves state-of-the-art results by effectively addressing these issues.
Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation
Peihua Deng · Jiehua Zhang · Xichun Sheng · Chenggang Yan · Yaoqi Sun · Ying Fu · Liang Li
This paper explores the Class-Incremental Source-Free Unsupervised Domain Adaptation (CI-SFUDA) problem, where the unlabeled target data come incrementally without access to labeled source instances. This problem poses two challenges, the disturbances of similar source-class knowledge to target-class representation learning and the new target knowledge to old ones. To address them, we propose the Multi-Granularity Class Prototype Topology Distillation (GROTO) algorithm, which effectively transfers the source knowledge to the unlabeled class-incremental target domain. Concretely, we design the multi-granularity class prototype self-organization module and prototype topology distillation module. Firstly, the positive classes are mined by modeling two accumulation distributions. Then, we generate reliable pseudo-labels by introducing multi-granularity class prototypes, and use them to promote the positive-class target feature self-organization. Secondly, the positive-class prototypes are leveraged to construct the topological structures of source and target feature spaces. Then, we perform the topology distillation to continually mitigate the interferences of new target knowledge to old ones. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performances on three public datasets.
Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
Xiran Wang · Jian Zhang · Lei Qi · Yinghuan Shi
Domain generalization is proposed to address distribution shift, arising from statistical disparities between training source and unseen target domains. The widely used first-order meta-learning algorithms demonstrate strong performance for domain generalization by leveraging the gradient matching theory, which aims to establish balanced parameters across source domains to reduce overfitting to any particular domain. However, our analysis reveals that there are actually numerous directions to achieve gradient matching, with current methods representing just one possible path. These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. To address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. This approach, while adhering to the principles of gradient matching, promotes a more precise balance by estimating the centroid between domain-specific optimal parameters. Experimental results conducted on ten datasets validate the effectiveness of our strategy. Our code is available in the supplementary material.
ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation
Yushan Lai · Guowen Li · Haoyuan Liang · Juepeng Zheng · Zhiyu Ye
Black-box Domain Adaptation (BDA) utilizes a black-box predictor of the source domain to label target domain data, addressing privacy concerns in Unsupervised Domain Adaptation (UDA). However, BDA assumes identical label sets across domains, which is unrealistic. To overcome this limitation, we propose a study on BDA with unknown classes in the target domain. It uses a black-box predictor to label target data and identify "unknown" categories, without requiring access to source domain data or predictor parameters, thus addressing both data privacy and category shift issues in traditional UDA. Existing methods face two main challenges: (i) Noisy pseudo-labels in knowledge distillation (KD) accumulate prediction errors, and (ii) relying on a preset threshold fails to adapt to varying category shifts. To address these, we propose ADU, a framework that allows the target domain to autonomously learn pseudo-labels guided by quality and use an adaptive threshold to identify "unknown" categories. Specifically, ADU consists of Selective Amplification Knowledge Distillation (SAKD) and Entopy-Driven Label Differentiation (EDLD). SAKD improves KD by focusing on high-quality pseudo-labels, mitigating the impact of noisy labels. EDLD categorizes pseudo-labels by quality and applies tailored training strategies to distinguish "unknown" categories, improving detection accuracy and adaptability. Extensive experiments show that ADU achieves state-of-the-art results, outperforming the best existing method by 3.1\% on VisDA in the OPBDA scenario.
Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
Dongkwan Lee · Kyomin Hwang · Nojun Kwak
We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps.
Distilling Long-tailed Datasets
Zhenghao Zhao · Haoxuan Wang · Yuzhang Shang · Kai Wang · Yan Yan
Dataset distillation (DD) aims to synthesize a small information-rich dataset from a large dataset for efficient neural network training. However, existing dataset distillation methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) The distillation process on imbalanced datasets develops biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. 2) The experts trained on such datasets perform suboptimally on tail classes, resulting in misguided distillation supervision and poor-quality soft-label initialization. To address these issues, we first propose Distribution-agnostic Matching to avoid directly matching the biased expert trajectories. It reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. Moreover, we improve the distillation guidance with Expert Decoupling, which jointly matches the decoupled backbone and classifier to improve the tail class performance and initialize reliable soft labels. This work pioneers the field of long-tailed dataset distillation (LTDD), marking the first effective effort to distill long-tailed datasets.
Open Set Label Shift with Test Time Out-of-Distribution Reference
Changkun Ye · Russell Tsuchida · Lars Petersson · Nick Barnes
Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class.In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. With reasonable assumptions on the ID/OOD classifier, the estimators are assembled into a sequence of three stages: 1) an estimate of the source label distribution of the OOD class, 2) an EM algorithm for Maximum Likelihood estimates (MLE) of the target label distribution, and 3) an estimate of the target label distribution of OOD class under relaxed assumptions on the OOD classifier.The sampling errors of estimates in 1) and 3) are quantified with a concentration inequality.The estimation result allows us to correct the ID classifier trained on the source distribution to the target distribution without retraining.Experiments on a variety of open set label shift settings demonstrate the effectiveness of our model in both estimation error and classification accuracy.
OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary
Yifeng Yang · Lin Zhu · Zewen Sun · Hengyu Liu · Qinying Gu · Nanyang Ye
Out-of-distribution (OOD) detection remains challenging for deep learning models, particularly when test-time OOD samples differ significantly from training outliers. We propose OODD, a novel test-time OOD detection method that dynamically maintains and updates an OOD dictionary without fine-tuning. Our approach leverages a priority queue-based dictionary that accumulates representative OOD features during testing, combined with an informative inlier sampling strategy for in-distribution (ID) samples. To ensure stable performance during early testing, we propose a dual OOD stabilization mechanism that leverages strategically generated outliers derived from ID data. To our best knowledge, extensive experiments on the OpenOOD benchmark demonstrate that OODD significantly outperforms existing methods, achieving a 26.0\% improvement in FPR95 on CIFAR-100 Far OOD detection compared to the state-of-the-art approach. Furthermore, we present an optimized variant of the KNN-based OOD detection framework that achieves a 3x speedup while maintaining detection performance.
pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation
Yifei Zhang · Hao Zhu · Alysa Ziying Tan · Dianzhi Yu · Longtao Huang · Han Yu
Federated learning (FL) has emerged as a promising paradigm for privacy-preserving collaborative machine learning. However, extending FL to class incremental learning settings introduces three key challenges: 1) spatial heterogeneity due to non-IID data distributions across clients, 2) temporal heterogeneity due to sequential arrival of tasks, and 3) resource heterogeneity due to diverse client capabilities. Existing approaches generally address these challenges in isolation, potentially leading to interference between updates, catastrophic forgetting, or excessive communication overhead. In this paper, we propose personalized Federated class-incremental parameter efficient fine-tuning with Mixture of Frequency aggregation (pFedMixF), a novel framework that simultaneously addresses all three heterogeneity challenges through frequency domain decomposition. Our key insight is that assigning orthogonal frequency components to different clients and tasks enables interference-free learning to be achieved with minimal communication costs. We further design an Auto-Task Agnostic Classifier that automatically routes samples to task-specific classifiers while adapting to heterogeneous class distributions.We conduct extensive experiments on three benchmark datasets, comparing our approach with eight state-of-the-art methods. The results demonstrate that \methodname{} achieves comparable test accuracy while requiring only 25% of the entire model parameters and incurring significantly lower communication costs than baseline methods.
FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors
Changlong Shi · He Zhao · Bingjie Zhang · Mingyuan Zhou · Dandan Guo · Yi Chang
Federated Learning (FL) has emerged as a promising framework for distributed machine learning, enabling collaborative model training without sharing local data, thereby preserving privacy and enhancing security. However, data heterogeneity resulting from differences across user behaviors, preferences, and device characteristics poses a significant challenge for federated learning. Most previous works overlook the adjustment of aggregation weights, relying solely on dataset size for weight assignment, which often leads to unstable convergence and reduced model performance. Recently, several studies have sought to refine aggregation strategies by incorporating dataset characteristics and model alignment. However, adaptively adjusting aggregation weights while ensuring data security—without requiring additional proxy data—remains a significant challenge. In this work, we propose Federated learning with Adaptive Weight Aggregation (FedAWA), a novel method that adaptively adjusts aggregation weights based on client vectors during the learning process. The client vector captures the direction of model updates, reflecting local data variations, and is used to optimize the aggregation weight without requiring additional datasets or violating privacy. By assigning higher aggregation weights to local models whose updates align closely with the global optimization direction, FedAWA enhances the stability and generalization of the global model. Extensive experiments under diverse scenarios demonstrate the superiority of our method, providing a promising solution to the challenges of data heterogeneity in federated learning.
Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter
Zhengyi Zhong · Weidong Bao · Ji Wang · Shuai Zhang · Jingxuan Zhou · Lingjuan Lyu · Wei Yang Bryan Lim
Federated Learning is a promising paradigm for privacy-preserving collaborative model training. In practice, it is essential not only to continuously train the model to acquire new knowledge but also to guarantee old knowledge the right to be forgotten (i.e., federated unlearning), especially for privacy-sensitive information or harmful knowledge. However, current federated unlearning methods face several challenges, including indiscriminate unlearning of cross-client knowledge, irreversibility of unlearning, and significant unlearning costs. To this end, we propose a method named FUSED, which first identifies critical layers by analyzing each layer’s sensitivity to knowledge and constructs sparse unlearning adapters for sensitive ones. Then, the adapters are trained without altering the original parameters, overwriting the unlearning knowledge with the remaining knowledge. This knowledge overwriting process enables FUSED to mitigate the effects of indiscriminate unlearning. Moreover, the introduction of independent adapters makes unlearning reversible and significantly reduces the unlearning costs. Finally, extensive experiments on five datasets across three unlearning scenarios demonstrate that FUSED's effectiveness is comparable to Retraining, surpassing all other baselines while greatly reducing unlearning costs. The code is available at https://anonymous.4open.science/r/FUSED-4E8E.
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
Yongli Xiang · Ziming Hong · Lina Yao · Dadong Wang · Tongliang Liu
Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising, The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 54.3% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.
Improving the Training of Data-Efficient GANs via Quality Aware Dynamic Discriminator Rejection Sampling
Zhaoyu Zhang · Yang Hua · Guanxiong Sun · Hui Wang · Seán F. McLoone
Data Efficient Generative Adversarial Networks (DE-GANs) have become more and more popular in recent years. Existing methods apply data augmentation, noise injection and pre-trained models to maximumly increase the number of training samples thus improving the training of DE-GANs. However, none of these methods considers the sample quality during training, which can also significantly influence the DE-GANs training. Focusing on the sample quality during training, in this paper, we are the first to incorporate discriminator rejection sampling (DRS) into the training process and introduce a novel method, called quality aware dynamic discriminator rejection sampling (QADDRS). Specifically, QADDRS consists of two steps: (1) the sample quality aware step, which aims to obtain the sorted critic scores, i.e., the ordered discriminator outputs, on real/fake samples in the current training stage; (2) the dynamic rejection step that obtains dynamic rejection number $N$, where $N$ is controlled by the overfitting degree of $D$ during training. When updating the parameters of the $D$, the $N$ high critic score real samples and the $N$ low critic score fake samples in the minibatch are rejected based on the overfitting degree of $D$ dynamically. As a result, QADDRS can avoid $D$ becoming overly confident in distinguishing both real and fake samples, thereby alleviating the overfitting of $D$ issue during training. Extensive experiments on several datasets demonstrate that QADDRS can achieve better performance across different DE-GANs and deliver state-of-the-art performance compared with other GANs and diffusion models.
EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection
Ming Sun · Rui Wang · Zixuan Zhu · Lihua Jing · Yuanfang Guo
High-quality open-source datasets are essential for advancing deep neural networks. However, the unauthorized commercial use of these datasets has raised significant concerns about copyright protection. One promising approach is backdoor watermark-based dataset ownership verification (BW-DOV), in which dataset protectors implant specific backdoors into illicit models through dataset watermarking, enabling tracing these models through abnormal prediction behaviors. Unfortunately, the targeted nature of these BW-DOV methods can be maliciously exploited, potentially leading to harmful side effects. While existing harmless methods attempt to mitigate these risks, watermarked datasets can still negatively affect prediction results, partially compromising dataset functionality. In this paper, we propose a more harmless backdoor watermark, called EntropyMark, which improves prediction confidence without altering the final prediction results. For this purpose, an entropy-based constraint is introduced to regulate the probability distribution. Specifically, we design an iterative clean-label dataset watermarking framework. Our framework employs gradient matching and adaptive data selection to optimize backdoor injection. In parallel, we introduce a hypothesis test method grounded in entropy inconsistency to verify dataset ownership. Extensive experiments on benchmark datasets demonstrate the effectiveness, transferability, and defense resistance of our approach.
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks
Yong Xie · Weijie Zheng · Hanxun Huang · Guangnan Ye · Xingjun Ma
As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks.In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, showing that PMA outperforms the current state-of-the-art individual methods.Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration
Jiani Ni · He Zhao · Jintong Gao · Dandan Guo · Hongyuan Zha
In recent years, deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. However, despite their success, they often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare, where unreliable predictions can have serious consequences. Recent research starts to improve model calibration from the view of classifier. However, the explore about designing the classifier to solve the model calibration problem is insufficient. Let alone most of existing methods ignore the calibration errors arising from underconfidence. In this work, we propose a novel method by Balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. By introducing a confidence-tunable module and a dynamic adjustment method, we ensure better alignment between model confidence and its true accuracy. Extensive experimental validation shows that ours significantly improves model calibration performance while maintaining high predictive accuracy, outperforming existing techniques. This provides a novel solution to the calibration challenges commonly encountered in deep learning.
Incomplete Multi-View Multi-label Learning via Disentangled Representation and Label Semantic Embedding
Xu Yan · Jun Yin · Jie Wen
In incomplete multi-view multi-label learning scenarios, it is crucial to use the missing multi-view data to extract consistent and specific representations from different data sources and to fully utilize the missing label information. However, most of the previous approaches ignore the separation problem between view-shared information and specific information. To address this problem, in this paper, we build an approach that can separate view consistent features from view specific features under the Variational Autoencoder (VAE) framework. Specifically, first we introduce cross-view reconstruction to learn view consistent features, and extract the shared information in each view through unsupervised pre-training. Subsequently, we develop a disentangling module to learn the disentangled specific features by minimizing the upper bound of mutual information between the consistent features and the specific features. Finally, we utilize a priori label relevance to guide the learning of label semantic embeddings, aggregating relevant semantic embeddings and maintaining the topology of label relevance in the semantic space. In extensive experiments, our model outperforms existing state-of-the-art algorithms on several real-world datasets, which fully validates its strong adaptability in the absence of views and labels.
ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence
Yuan Sun · Yongxiang Li · Zhenwen Ren · Guiduo Duan · Dezhong Peng · Peng Hu
Multi-view clustering (MVC) aims to exploit complementary information from diverse views to enhance clustering performance. Since pseudo-labels can provide additional semantic information, many MVC methods have been proposed to guide unsupervised multi-view learning through pseudo-labels. These methods implicitly assume that the predicted pseudo-labels are predicted correctly. However, due to the challenges in training a flawless unsupervised model, this assumption can be easily violated, thereby leading to the Noisy Pseudo-label Problem (NPP). Moreover, these existing approaches typically rely on the assumption of perfect cross-view alignment. In practice, it is frequently compromised due to noise or sensor differences, thereby resulting in the Noisy Correspondence Problem (NCP). Based on the above observations, we reveal and study unsupervised multi-view learning under NPP and NCP. To this end, we propose Robust Noisy Pseudo-label Learning (ROLL) to prevent the overfitting problem caused by both NPP and NCP. Specifically, we first adopt traditional contrastive learning to warm up the model, thereby generating the pseudo-labels in a self-supervised manner. Afterward, we propose noise-tolerance pseudo-label learning to deal with the noise in the predicted pseudo-labels, thereby embracing the robustness against NPP. To further mitigate the overfitting problem, we present robust multi-view contrastive learning to mitigate the negative impact of NCP. Extensive experiments on five multi-view datasets demonstrate the superior clustering performance of our ROLL compared to 11 state-of-the-art methods.
Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class.This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach theoretically guarantees true feature recovery under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.
Multi-modal Contrastive Learning with Negative Sampling Calibration for Phenotypic Drug Discovery
Jiahua Rao · Hanjing Lin · Leyu Chen · Jiancong Xie · Shuangjia Zheng · Yuedong Yang
Phenotypic drug discovery presents a promising strategy for identifying first-in-class drugs by bypassing the need for specific drug targets. Recent advances in cell-based phenotypic screening tools, including Cell Painting and the LINCS L1000, provide essential cellular data that capture biological responses to compounds. While the integration of the multi-modal data enhances the use of contrastive learning (CL) methods for molecular phenotypic representation, these approaches treat all negative pairs equally, failing to discriminate molecules with similar phenotypes. To address these challenges, we introduce a foundational framework MINER that dynamically estimates the likelihoods of sample pairs as negative pairs based on uni-modal disentangled representations. In addition, our approach incorporates a mixture fusion strategy to effectively integrate multimodal data, even in cases where certain modalities are missing. Extensive experiments demonstrate that our method enhances both molecular property prediction and molecule-phenotype retrieval accuracy. Moreover, it successfully recommends drug candidates from phenotype for complex diseases documented in the literature. These findings underscore MINER’s potential to advance drug discovery by enabling deeper insights into disease mechanisms and improving drug candidate recommendations.
Multi-modal Medical Diagnosis via Large-small Model Collaboration
Wanyi Chen · Zihua Zhao · Jiangchao Yao · Ya Zhang · Jiajun Bu · Haishuai Wang
Recent advances in medical AI have shown a clear trend towards large models in healthcare. However, developing large models for multi-modal medical diagnosis remains challenging due to a lack of sufficient modal-complete medical data. Most existing multi-modal diagnostic models are relatively small and struggle with limited feature extraction capabilities. To bridge this gap, we propose AdaCoMed, an adaptive collaborative-learning framework that synergistically integrates the off-the-shelf medical single-modal large models with multi-modal small models. Our framework first employs a mixture-of-modality-experts (MoME) architecture to combine features extracted from multiple single-modal medical large models, and then introduces a novel adaptive co-learning mechanism to collaborate with a multi-modal small model. This co-learning mechanism, guided by an adaptive weighting strategy, dynamically balances the complementary strengths between the MoME-fused large model features and the cross-modal reasoning capabilities of the small model. Extensive experiments on two representative multi-modal medical datasets (MIMIC-IV-MM and MMIST ccRCC) across six modalities and four diagnostic tasks demonstrate consistent improvements over state-of-the-art baselines, making it a promising solution for real-world medical diagnosis applications.
Towards All-in-One Medical Image Re-Identification
Yuan Tian · Kaiyuan Ji · Rongzhao Zhang · Yankai Jiang · Chunyi Li · Xiaosong Wang · Guangtao Zhai
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection.In this paper, we introduce a thorough benchmark and a unified model for this problem.First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data.Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features.Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images.We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance.Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection.
FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models
Alice Heiman · Xiaoman Zhang · Emma Chen · Sung Eun Kim · Pranav Rajpurkar
Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report.After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78\% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0\% in reducing measurement hallucinations measured by mean absolute error.
Interactive Medical Image Analysis with Concept-based Similarity Reasoning
Ta Duc Huy · Sen Kim Tran · Phan Nguyen · Nguyen Hoang Tran · Tran Bao Sam · Anton van den Hengel · Zhibin Liao · Johan Verjans · Minh-Son To · Vu Minh Hieu Phan
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.\% across three biomedical datasets.
Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning
Tim Lenz · Peter Neidlinger · Marta Ligero · Georg Wölflein · Marko van Treeck · Jakob Nikolas Kather
Representation learning of pathology whole-slide images (WSIs) has primarily relied on weak supervision with Multiple Instance Learning (MIL). This approach leads to slide representations highly tailored to a specific clinical task. Self-supervised learning (SSL) has been successfully applied to train histopathology foundation models (FMs) for patch embedding generation. However, generating patient or slide level embeddings remains challenging. Existing approaches for slide representation learning extend the principles of SSL from patch level learning to entire slides by aligning different augmentations of the slide or by utilizing multimodal data. By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. Our contrastive pretraining strategy, called COBRA, employs multiple FMs and an architecture based on Mamba-2. COBRA exceeds performance of state-of-the-art slide encoders on four different public CPTAC cohorts on average by at least +3.8% AUC, despite only being pretrained on 3048 WSIs from TCGA. Additionally, COBRA is readily compatible at inference time with previously unseen feature extractors.
Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning
Jiuyang Dong · Junjun Jiang · Kui Jiang · Jiahan Li · Yongbing Zhang
Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs).To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches.HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts.During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches.These masks then guide the training of LIPN to predict the relevance of each low-resolution patch.During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation.In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers.Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13\% in AUC while reducing inference time by 28.6\% on the Camelyon16 dataset.The project will be available.
ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics
Junchao Zhu · Ruining Deng · Tianyuan Yao · Juming Xiong · Chongyu Qu · Junlin Guo · Siqi Lu · Mengmeng Yin · Yu Wang · Shilin Zhao · Haichun Yang · Yuankai Huo
Spatial transcriptomics (ST) is an emerging technology that enables medical computer vision scientists to automatically interpret the molecular profiles underlying morphological features. Currently, however, most deep learning-based ST analyses are limited to two-dimensional (2D) sections, which can introduce diagnostic errors due to the heterogeneity of pathological tissues across 3D sections. Expanding ST to three-dimensional (3D) volumes is challenging due to the prohibitive costs; a 2D ST acquisition already costs over 50 times more than whole slide imaging (WSI), and a full 3D volume with 10 sections can be an order of magnitude more expensive. To reduce costs, scientists have attempted to predict ST data directly from WSI without performing actual ST acquisition. However, these methods typically yield unsatisfying results. To address this, we introduce a novel problem setting: 3D ST imputation using 3D WSI histology sections combined with a single 2D ST slide. To do so, we present the Anatomy-aware Spatial Imputation Graph Network (ASIGN) for more precise, yet affordable, 3D ST modeling. The ASIGN architecture extends existing 2D spatial relationships into 3D by leveraging cross-layer overlap and similarity-based expansion. Moreover, a multi-level spatial attention graph network integrates features comprehensively across different data sources. We evaluated ASIGN on three public spatial transcriptomics datasets, with experimental results demonstrating that ASIGN achieves state-of-the-art performance on both 2D and 3D scenarios.
beta-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation
Ming Hu · Jianfu Yin · Zhuangzhuang Ma · Jianheng Ma · Feiyu Zhu · Bingbing Wu · Ya Wen · Meng Wu · C Hu · Bingliang Hu · Quan Wang
Co-training has achieved significant success in the field of semi-supervised learning; however, the homogenization phenomenon, which arises from multiple models tending towards similar decision boundaries, remains inadequately addressed. To tackle this issue, we propose a novel algorithm called β-FFT from the perspectives of data diversity and training structure.First, from the perspective of data diversity, we introduce a nonlinear interpolation method based on the Fast Fourier Transform (FFT). This method generates more diverse training samples by swapping low-frequency components between pairs of images, thereby enhancing the model's generalization capability. Second, from the structural perspective, we propose a differentiated training strategy to alleviate the homogenization issue in co-training. In this strategy, we apply additional training with labeled data to one model in the co-training framework, while employing linear interpolation based on the Beta (β) distribution for the unlabeled data as a regularization term for the additional training. This approach enables us to effectively utilize the limited labeled data while simultaneously improving the model's performance on unlabeled data, ultimately enhancing the overall performance of the system.Experimental results demonstrate that β-FFT outperforms current state-of-the-art (SOTA) methods on three public medical image datasets.
DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation
Maregu Assefa · Muzammal Naseer · IYYAKUTTI IYAPPAN GANAPATHI · Syed Sadaf Ali · Mohamed L Seghier · Naoufel Werghi
Semi-supervised learning in medical image segmentation leverages unlabeled data to reduce annotation burdens through consistency learning. However, current methods struggle with class imbalance and high uncertainty from pathology variations, leading to inaccurate segmentation in 3D medical images. To address these challenges, we present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that enhances the generalization of consistency methods with two complementary losses: Uncertainty-aware Consistency Loss (UnCL) and Focal Entropy-aware Contrastive Loss (FeCL). UnCL enforces global consistency by dynamically weighting the contribution of each voxel to the consistency loss based on its uncertainty, preserving high-uncertainty regions instead of filtering them out. Initially, UnCL prioritizes learning from uncertain voxels with lower penalties, encouraging the model to explore challenging regions. As training progress, the penalty shift towards confident voxels to refine predictions and ensure global consistency. Meanwhile, FeCL enhances local feature discrimination in imbalanced regions by introducing dual focal mechanisms and adaptive confidence adjustments into the contrastive principle. These mechanisms jointly prioritizes hard positives and negatives while focusing on uncertain sample pairs, effectively capturing subtle lesion variations under class imbalance. Extensive evaluations on four diverse medical image segmentation datasets (ISLES'22, BraTS'19, LA, Pancreas) show DyCON's superior performance against SOTA methods.
Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
Saad Wazir · Daeyoung Kim
Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76\% on MoNuSeg, 3.12\% on DSB, 2.87\% on Electron Microscopy, and 4.03\% on TNBC datasets compared to existing SOTA methods. The necessary codes and checkpoints for reproduction will be released publicly later.
LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
Maximilian Rokuss · Yannick Kirchhoff · Seval Akbal · Balint Kovacs · Saikat Roy · Constantin Ulrich · Tassilo Wald · Lukas T. Rotkopf · Heinz-Peter Schlemmer · Klaus Maier-Hein
In this work, we present LesionLocator, a framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, establishing the first end-to-end model capable of 4D tracking with dense spatial prompts. Our model leverages an extensive dataset of 23,262 annotated medical scans, as well as synthesized longitudinal data across diverse lesion types. The diversity and scale of our dataset significantly enhances model generalizability to real-world medical imaging challenges and addresses key limitations in longitudinal data availability. LesionLocator outperforms all existing promptable models in lesion segmentation by nearly 10 dice points, reaching human-level performance, and achieves state-of-the-art results in lesion tracking, with superior lesion retrieval and segmentation accuracy. LesionLocator not only sets a new benchmark in universal promptable lesion segmentation and automated longitudinal lesion tracking but also provides the first open-access solution of its kind, releasing our synthetic 4D dataset and model to the community, empowering future advancements in medical imaging. Code will be made available at: www.github.com/anonymous
DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction
Junjie Zhou · Shouju Wang · Yuxia Tang · Qi Zhu · Daoqiang Zhang · WEI SHAO
The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among multi-modal TME components may cause side effects i.e., the best uni-modal model may outperform the joint generative model. To address the above issues, we propose a \Divergence-Aware Multi-Modal Diffusion model (i.e., DAMM-Diffusion) to adaptively generate the prediction results from uni-modal and multi-modal branches in a unified network. In detail, the uni-modal branch is composed of the U-Net architecture while the multi-modal branch extends it by introducing two novel fusion modules i.e., Multi-Modal Fusion Module (MMFM) and Uncertainty-Aware Fusion Module (UAFM). Specifically, the MMFM is proposed to fuse features from multiple modalities, while the UAFM module is introduced to learn the uncertainty map for cross-attention computation. Following the individual prediction results from each branch, the Divergence-Aware Multi-Modal Predictor (DAMMP) module is proposed to assess the consistency of multi-modal data with the uncertainty map, which determines whether the final prediction results come from multi-modal or uni-modal predictions. We predict the NPs distribution given the TME components of tumor vessels and cell nuclei, and the experimental results show that DAMM-Diffusion can generate the distribution of NPs with higher accuracy than the comparing methods. Additional results on the multi-modal brain image synthesis task further validate the effectiveness of the proposed method.
DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image
Ziwei Zhao · Zhixing Zhang · Yuhang Liu · Zhao Zhang · Haojun Yu · Dong Wang · Liwei Wang
In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework.
MultiMorph: On-demand Atlas Construction
Mazdak Abulnaga · Andrew Hoopes · Neel Dey · Malte Hoffmann · Bruce Fischl · John Guttag · Adrian V. Dalca
We present a method for constructing anatomical atlases on the fly. An atlas is an image that represents the prototypical structure of a collection of images. Among other uses, atlases play a key role in studies of anatomical variability across populations. Existing atlas construction methods are computationally prohibitive, requiring days to weeks of computation. Consequently, many scientific studies are forced to use suboptimal atlases constructed for different population groups, negatively impacting downstream analyses. In this work, we present MultiMorph, a model that rapidly produces 3D anatomical atlases for any set of brain MRI images. MultiMorph enables medical researchers with no machine learning background to rapidly construct high-quality population-specific atlases in a single forward network pass, without requiring any fine tuning or optimization. MultiMorph is based on a linear group-interaction layer that aggregates and shares features within the group of input images. We demonstrate that MultiMorph outperforms state-of-the-art optimization-based and machine-learning-based atlas construction methods in both small and large population settings. It generates better atlases with a 100-fold reduction in computational time. Further, we demonstrate generalization to new imaging modalities and population groups at test-time.
Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model
Yejee Shin · Yeeun Lee · Hanbyol Jang · Geonhui Son · Hyeongyu Kim · Dosik Hwang
Multi-contrast magnetic resonance (MR) images offer critical diagnostic information but are limited by long scan times and high cost. While diffusion models (DMs) excel in medical image synthesis, they often struggle to maintain anatomical consistency and utilize the diverse characteristics of multi-contrast MR images effectively.We propose APT, a unified diffusion model designed to generate accurate and anatomically consistent multi-contrast MR images. APT introduces a mutual information fusion module and an anatomical consistency loss to preserve critical anatomical structures across multiple contrast inputs. To enhance synthesis, APT incorporates a two-stage inference process: in the first stage, a prior codebook provides coarse anatomical structures by selecting appropriate guidance based on precomputed similarity mappings and Bézier curve transformations. The second stage applies iterative unrolling with weighted averaging to refine the initial output, enhancing fine anatomical details and ensuring structural consistency.This approach enables the preservation of both global structures and local details, resulting in realistic and diagnostically valuable synthesized images. Extensive experiments on public multi-contrast MR brain images demonstrate that our approach significantly outperforms state-of-the-art methods.
CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections
Thomas Walker · Salvatore Esposito · Daniel Rebain · Amir Vaxman · Arno Onken · Changjian Li · Oisin Mac Aodha
Reconstructing complex structures from planar cross-sections is a challenging problem, with wide-reaching applications in medical imaging, manufacturing, and topography. Out-of-the-box point cloud reconstruction methods can often fail due to the data sparsity between slicing planes, while current bespoke methods struggle to reconstruct thin geometric structures and preserve topological continuity. This is important for medical applications where thin vessel structures are present in CT and MRI scans. This paper introduces CrossSDF, a novel approach for extracting a 3D signed distance field from 2D signed distances generated from planar contours. Our approach makes the training of neural SDFs contour-aware by using losses designed for the case where geometry is known within 2D slices. Our results demonstrate a significant improvement over existing methods, effectively reconstructing thin structures and producing accurate 3D models without the interpolation artifacts or over-smoothing of prior approaches.