CVPR 2026 Orals

Skip to yearly menu bar Skip to main content

Oral

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu ⋅ Shuhao Cui ⋅ Haoxiang Cao ⋅ Shuai Ma ⋅ Kai Wu ⋅ Guoliang Kang

Jun 5, 9:15 AM - 9:30 AM Bluebird Ballroom

Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.

View full details

Oral

Advancing Image Classification with Discrete Diffusion Classification Modeling

Omer Belhasin ⋅ Shelly Golan ⋅ Ran El-Yaniv ⋅ Michael Elad

Jun 5, 9:15 AM - 9:27 AM Mile High Ballroom 1A - 2A

Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging.

View full details

Oral

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

Zengyi Yang ⋅ Yu Liu ⋅ Juan Cheng ⋅ Zhiqin Zhu ⋅ Yafei Zhang ⋅ Huafeng Li

Jun 5, 9:15 AM - 9:27 AM Mile High Ballroom 3A - 4A

Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote accurate semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability.

View full details

Oral

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

Tao Qi ⋅ Huili Wang ⋅ Yuanhong Huang ⋅ Wendan Wang ⋅ Lianchao Zhao ⋅ Jinrui Wang ⋅ Zichen Qin ⋅ Shangguang Wang ⋅ Yongfeng Huang

Jun 5, 9:15 AM - 9:27 AM Four Seasons Ballroom

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data.Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training.Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status.However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data).Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality.In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models.We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

View full details

Oral

Does YOLO Really Need to See Every Training Image in Every Epoch?

Xingxing Xie ⋅ Jiahua Dong ⋅ Junwei Han ⋅ Gong Cheng

Jun 5, 9:27 AM - 9:40 AM Mile High Ballroom 1A - 2A

YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently.Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Medium training images are partially selected, with priority given to recently unused ones and the remaining quota filled randomly to ensure short-term coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones.On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors (e.g., YOLOv8, YOLOv10, YOLO11, YOLO12) while also improving detection accuracy.

View full details

Oral

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Yeshwanth Kumar Adimoolam ⋅ Charalambos Poullis ⋅ Melinos Averkiou

Jun 5, 9:27 AM - 9:40 AM Four Seasons Ballroom

In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.

View full details

Oral

Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions

Sriram Narayanan ⋅ Mani Ramanagopal ⋅ Srinivasa G. Narasimhan

Jun 5, 9:27 AM - 9:40 AM Mile High Ballroom 3A - 4A

Long-wave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.

View full details

Oral

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Wenjie Zhu ⋅ Yabin Zhang ⋅ Xin Jin ⋅ Wenjun Zeng ⋅ Lei Zhang

Jun 5, 9:30 AM - 9:45 AM Bluebird Ballroom

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

View full details

Oral

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

Zhichao Yang ⋅ Jianjie Wang ⋅ Zhixianhe Zhang ⋅ Pangu Xie ⋅ Xiangfei Sheng ⋅ Pengfei Chen ⋅ Leida Li

Jun 5, 9:40 AM - 9:52 AM Mile High Ballroom 1A - 2A

Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method. Data and model will be made publicly available.

View full details

Oral

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad ⋅ Nils Lukas ⋅ Karthik Nandakumar

Jun 5, 9:40 AM - 9:52 AM Four Seasons Ballroom

Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.

View full details

Oral

MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

Yuxuan Liu ⋅ Wei Xu ⋅ Qi Guo

Jun 5, 9:40 AM - 9:52 AM Mile High Ballroom 3A - 4A

We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.

View full details

Oral

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu ⋅ Ziqian Zeng ⋅ Kehua Zhang ⋅ Haoran Li ⋅ Huiping Zhuang ⋅ Ruidong Wang ⋅ Cen Chen ⋅ Hao Peng

Jun 5, 9:45 AM - 10:00 AM Bluebird Ballroom

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.

View full details

Oral

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

Ziteng Wei ⋅ Qiang He ⋅ Bing Li ⋅ Feifei Chen ⋅ Hai Jin ⋅ Yun Yang

Jun 5, 9:52 AM - 10:05 AM Mile High Ballroom 1A - 2A

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69× pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss.

View full details

Oral

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

Yuanming Cao ⋅ Chengqi Li ⋅ Wenbo He

Jun 5, 9:52 AM - 10:05 AM Four Seasons Ballroom

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

View full details

Oral

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He ⋅ Guanyu Hou ⋅ Hongwei Li ⋅ Zhicong Huang ⋅ Kangjie Chen ⋅ Yi Yu ⋅ Wenbo Jiang ⋅ Guowen Xu ⋅ Tianwei Zhang

Jun 5, 10:00 AM - 10:15 AM Bluebird Ballroom

Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods, which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a $\textbf{TE}$mporal-aware $\textbf{A}$utomated $\textbf{R}$ed-teaming framework, named $\textbf{TEAR}$, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80\% attack success rate, a significant boost from prior best result of 57\%.

View full details

Oral

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi ⋅ Siming Zheng ⋅ Zerong Wang ⋅ Hao Zhang ⋅ Jinwei Chen ⋅ Bo Li ⋅ Shifeng Chen ⋅ Peng-Tao Jiang

Jun 5, 10:05 AM - 10:17 AM Mile High Ballroom 3A - 4A

Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. The code will be released publicly.

View full details

Oral

NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization

Edwin Vargas ⋅ Jhon Lopez ⋅ Henry Arguello ⋅ Ashok Veeraraghavan

Jun 5, 10:05 AM - 10:17 AM Four Seasons Ballroom

Ensuring the authenticity and ownership of digital images is increasingly challenging as modern editing tools enable highly realistic forgeries. Existing image protection systems mainly rely on digital watermarking, which is susceptible to sophisticated digital attacks. To address this limitation, we propose a hybrid optical-digital framework that incorporates physical authentication cues during image formation and preserves them through a learned reconstruction process. At the optical level, a phase mask in the camera aperture produces a Null-space Optical Watermark (NOWA) that lies in the Null Space of the imaging operator and therefore remains invisible in the captured image. Then, a Null-Space Network (NSN) performs measurement-consistent reconstruction that delivers high-quality protected images while preserving the NOWA signature.The proposed design enables tamper localization by projecting the image onto the camera's null space and detecting pixel-level inconsistencies. Our design preserves perceptual quality, resists common degradations such as compression, and establishes a structural security asymmetry: without access to the optical or NSN parameters, adversaries cannot forge the NOWA signature. Experiments with simulations and a prototype camera demonstrate competitive performance in terms of image quality preservation and tamper localization accuracy compared to state-of-the-art digital watermarking and learning-based authentication methods.

View full details

Oral

Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

Jinyu Xu ⋅ Tianqi Hu ⋅ Xiaonan Hu ⋅ Letian Zhou ⋅ Songliang Cao ⋅ Meng Zhang ⋅ Hao Lu

Jun 5, 10:05 AM - 10:17 AM Mile High Ballroom 1A - 2A

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. Our dataset couples instance-level point annotations with complete Linnaean labels (kingdom$\rightarrow$species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The datasetfeatures $10,000$ images with $678,090$ point annotations, includes $268$ countable plant categories over $242$ plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.

View full details

Oral

ViT^3: Unlocking Test-Time Training in Vision

Dongchen Han ⋅ Yining Li ⋅ Tianyu Li ⋅ Zixuan Cao ⋅ Ziming Wang ⋅ Jun Song ⋅ Cheng Yu ⋅ Bo Zheng ⋅ Gao Huang

Jun 5, 10:15 AM - 10:30 AM Bluebird Ballroom

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key–value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code will be released.

View full details

Oral

UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

Alberto Rota ⋅ Mert Kiray ⋅ Mert Asim Karaoglu ⋅ Patrick Ruhkamp ⋅ Elena De Momi ⋅ Nassir Navab ⋅ Benjamin Busam

Jun 5, 10:17 AM - 10:30 AM Mile High Ballroom 3A - 4A

Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.

View full details

Oral

Rethinking Dataset Distillation: Hard Truths about Soft Labels

Priyam Dey ⋅ Aditya Sahdev ⋅ Sunny Bhati ⋅ Konda Reddy Mopuri ⋅ R. Venkatesh Babu

Jun 5, 10:17 AM - 10:30 AM Mile High Ballroom 1A - 2A

Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence \cite{qin2024a} finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L \cite{yin2024squeezerecoverrelabeldataset} due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED \cite{sun2024diversityrealismdistilleddataset} reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.

View full details

Oral

Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

Jeonggon Kim ⋅ Heejoon Moon ⋅ Je Hyeong Hong

Jun 5, 10:17 AM - 10:30 AM Four Seasons Ballroom

Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints.However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks.In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points.We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack.DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location.This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed:Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery.DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines.Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.

View full details

Oral

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Shai Bagon ⋅ Matan Kichler ⋅ Mark Sheinin

Jun 5, 10:30 AM - 10:42 AM Mile High Ballroom 3A - 4A

Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.

View full details

Oral

4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur ⋅ Marwan Taher ⋅ Andrew J. Davison

Jun 5, 1:00 PM - 1:12 PM Mile High Ballroom 3A - 4A

We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

View full details

Oral

MAMMA: Markerless Accurate Multi-person Motion Acquisition

Hanz Cuevas Velasquez ⋅ Anastasios Yiannakidis ⋅ Soyong Shin ⋅ Giorgio Becherini ⋅ Markus Höschle ⋅ Joachim Tesch ⋅ Taylor Obersat ⋅ Tsvetelina Alexiadis ⋅ Eni Halilaj ⋅ Michael J. Black

Jun 5, 1:00 PM - 1:12 PM Bluebird Ballroom

We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video.Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialised hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks.The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, method, code, and model weights for research purposes.

View full details

Oral

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

Zhicheng Liang ⋅ Haoyi Yu ⋅ Boyan Li ⋅ Dayou Zhang ⋅ Zijian Cao ⋅ Tianyi Gong ⋅ Junhua Liu ⋅ Shuguang Cui ⋅ Fangxin Wang

Jun 5, 1:00 PM - 1:12 PM Four Seasons Ballroom

Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 million multi-view frames. It encompasses diverse materials, complex lighting conditions, and a wide range of geometric forms—including shapes generated from both real and LLM-synthesized 2D images using diffusion-based methods. To support robust evaluation, we design benchmarks for four core tasks: image matching, reflection removal, structure-from-motion, and novel view synthesis. Through extensive experiments, we show that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models. We release the dataset, baselines, and evaluation suite to facilitate progress in this direction, which can be accessed at supplementary materials.

View full details

Oral

Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow

Yu Gao ⋅ Lutong Su ⋅ Ruixiang Huang ⋅ Tianji Jiang ⋅ Jiadong Tang ⋅ Yufeng Yue ⋅ Yi Yang

Jun 5, 1:00 PM - 1:12 PM Mile High Ballroom 1A - 2A

High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-based 3DGS lacks controllable sampling and thus cannot support progressive pose refinement. To address these challenges, we redesign the optimization strategy of Gaussian primitives and introduce an image-energy-guided constraint that encourages progressive alignment of camera poses. Experiments on both synthetic and real-world datasets show that Energy-GS can effectively optimize the scene reconstruction and resolve camera pose misalignment at the same time. Benefiting from reliance on only RGB images, we believe this work provides promising insights for visual localization and dense mapping applications such as SLAM.

View full details

Oral

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

Dingkun Wei ⋅ Zehong Shen ⋅ Yan Xia ⋅ Yujun Shen ⋅ Georgios Pavlakos ⋅ Xiaowei Zhou

Jun 5, 1:12 PM - 1:25 PM Bluebird Ballroom

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues—velocity and acceleration—which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail.We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, velocities, and accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines camera-space and world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion profiles.Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

View full details

Oral

GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

Youngju Na ⋅ Jaeseong Yun ⋅ Soohyun Ryu ⋅ Hyunsu Kim ⋅ Sung-Eui Yoon ⋅ Suyong Yeon

Jun 5, 1:12 PM - 1:25 PM Four Seasons Ballroom

While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate that GLINT achieves state-of-the-art performance in 3D reconstruction of complex transparent scenes.Our code will be released publicly.

View full details

Oral

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang ⋅ Guillaume Le Moing ⋅ Skanda Koppula ⋅ Ignacio Rocco ⋅ Liliane Momeni ⋅ Junyu Xie ⋅ Shuyang Sun ⋅ Rahul Sukthankar ⋅ Joëlle K. Barral ⋅ Raia Hadsell ⋅ Zoubin Ghahramani ⋅ Andrew Zisserman ⋅ Junlin Zhang ⋅ Mehdi S. M. Sajjadi

Jun 5, 1:12 PM - 1:25 PM Mile High Ballroom 3A - 4A

Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.

View full details

Oral

MeshSplatting: Differentiable Rendering with Opaque Meshes

Jan Held ⋅ Sanghyun Son ⋅ Renaud Vandeghen ⋅ Daniel Rebain ⋅ Matheus Gadelha ⋅ Yi Zhou ⋅ Anthony Cioppa ⋅ Ming C. Lin ⋅ Marc Van Droogenbroeck ⋅ Andrea Tagliasacchi

Jun 5, 1:12 PM - 1:25 PM Mile High Ballroom 1A - 2A

Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.

View full details

Oral

Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

Yuanyuan Gao ⋅ YUNING GONG ⋅ Yifei Liu ⋅ Jingfeng Li ⋅ Dan Xu ⋅ Yanci Zhang ⋅ Dingwen Zhang ⋅ Xiao Sun ⋅ Zhihang Zhong

Jun 5, 1:25 PM - 1:37 PM Mile High Ballroom 1A - 2A

3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed than the original 3DGS. Specifically, it achieves more than $2.5\times$ speedup over Octree-GS, and consistently delivers substantially higher rendering quality.

View full details

Oral

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Jianqi Chen ⋅ Biao Zhang ⋅ Xiangjun Tang ⋅ Peter Wonka

Jun 5, 1:25 PM - 1:37 PM Bluebird Ballroom

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects.

View full details

Oral

FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement

Haobo Jiang ⋅ Jin Xie ⋅ Jian Yang ⋅ Liang Yu ⋅ Jianmin Zheng

Jun 5, 1:25 PM - 1:37 PM Mile High Ballroom 3A - 4A

Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct FUSER's estimates through a denoising process over the joint SE(3)$^N$ space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.

View full details

Oral

Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

Shuo Chen ⋅ Yijin Li ⋅ Xi Zheng ⋅ Guofeng Zhang

Jun 5, 1:25 PM - 1:37 PM Four Seasons Ballroom

The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 μm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability.

View full details

Oral

PhyGaP: Physically-Grounded Gaussians with Polarization Cues

Jiale Wu ⋅ Xiaoyang Bai ⋅ Zongqi He ⋅ Weiwei Xu ⋅ YIFAN PENG

Jun 5, 1:37 PM - 1:50 PM Four Seasons Ballroom

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.

View full details

Oral

RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

Xuezhen Wang ⋅ Li Ma ⋅ Yulin Shen ⋅ Zeyu Wang ⋅ Pedro V. Sander

Jun 5, 1:37 PM - 1:50 PM Mile High Ballroom 1A - 2A

Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow–guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.

View full details

Oral

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Xitong Yang ⋅ Devansh Kukreja ⋅ Don Pinkus ⋅ Taosha Fan ⋅ Jinhyung Park ⋅ Soyong Shin ⋅ Jinkun Cao ⋅ Jia-Wei Liu ⋅ Nicolás Ugrinovic ⋅ Anushka Sagar ⋅ Jitendra Malik ⋅ Matt Feiszli ⋅ Piotr Dollár ⋅ Kris Kitani

Jun 5, 1:37 PM - 1:50 PM Bluebird Ballroom

We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder–decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

View full details

Oral

Residual Primitive Fitting of 3D Shapes with SuperFrusta

Aditya Ganeshan ⋅ Matheus Gadelha ⋅ Thibault Groueix ⋅ Zhiqin Chen ⋅ Siddhartha Chaudhuri ⋅ Vladimir G. Kim ⋅ Wang Yifan ⋅ Daniel Ritchie

Jun 5, 1:37 PM - 1:50 PM Mile High Ballroom 3A - 4A

We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.

View full details

Oral

PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

Isaac Deutsch ⋅ Nicolas Moënne-Loccoz ⋅ Gavriel State ⋅ Žan Gojčič

Jun 5, 1:50 PM - 2:02 PM Four Seasons Ballroom

Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available.

View full details

Oral

SAM 3D: 3Dfy Anything in Images

Xingyu Chen ⋅ Fu-Jen Chu ⋅ Pierre Gleize ⋅ Kevin J Liang ⋅ Alexander Sax ⋅ Hao Tang ⋅ Weiyao Wang ⋅ Michelle Guo ⋅ Thibaut Hardin ⋅ Xiang Li ⋅ Aohan Lin ⋅ Jia-Wei Liu ⋅ Ziqi Ma ⋅ Anushka Sagar ⋅ Bowen Song ⋅ Xiaodong Wang ⋅ Jianing "Jed" Yang ⋅ Bowen Zhang ⋅ Piotr Dollár ⋅ Georgia Gkioxari ⋅ Matt Feiszli ⋅ Jitendra Malik

Jun 5, 1:50 PM - 2:02 PM Bluebird Ballroom

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

View full details

Oral

Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment

Youming Deng ⋅ Songyou Peng ⋅ Junyi Zhang ⋅ Kathryn Heal ⋅ Tiancheng Sun ⋅ John Flynn ⋅ Steve Marschner ⋅ Lucy Chai

Jun 5, 1:50 PM - 2:02 PM Mile High Ballroom 1A - 2A

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

View full details

Oral

SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models

Chen Li ⋅ Shanshan Dong ⋅ Sheng Qiu ⋅ Jianmin Han ⋅ Yibo Zhao ⋅ Zan Gao ⋅ Taku Komura ⋅ Kemeng Huang

Jun 5, 1:50 PM - 2:02 PM Mile High Ballroom 3A - 4A

Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.

View full details

Oral

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Yumeng He ⋅ Ying Jiang ⋅ Jiayin Lu ⋅ Yin Yang ⋅ Chenfanfu Jiang

Jun 5, 2:02 PM - 2:15 PM Bluebird Ballroom

Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.

View full details

Oral

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du ⋅ Yiming Zhao ⋅ Zhenglong Guo ⋅ Yong Pan ⋅ Wenbo Hou ⋅ Zhihui Hao ⋅ Kun Zhan ⋅ Qijun Chen

Jun 5, 2:02 PM - 2:15 PM Mile High Ballroom 3A - 4A

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1‒3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

View full details

Oral

Z-Order Transformer for Feed-Forward Gaussian Splatting

Can Wang ⋅ Lei Liu ⋅ Wei Jiang ⋅ Dong Xu

Jun 5, 2:02 PM - 2:15 PM Mile High Ballroom 1A - 2A

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

View full details

Oral

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

Hongyu Wen ⋅ Jia Deng

Jun 5, 2:02 PM - 2:15 PM Four Seasons Ballroom

Transparent objects are common in daily life, and understanding their multi-layer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34\% to 70.67\%.

View full details

Oral

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

SHIYIN JIANG ⋅ Wei Long ⋅ Minghao Han ⋅ Zhenghao Chen ⋅ Ce Zhu ⋅ Shuhang Gu

Jun 6, 9:00 AM - 9:12 AM Mile High Ballroom 3A - 4A

The proliferation of visual data under tight storage and bandwidth budgets makes extremely low–bitrate generative image compression increasingly important. Vector quantization (VQ) is compelling in this regime because codebooks encode cross-channel correlations and dataset-level semantics, enabling perceptually faithful reconstructions when bits are scarce. We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. While end-to-end learned image codecs rely on a differentiable rate term for rate–distortion (RD) optimization, however, a key challenge is that naïvely integrating VQ introduces non-differentiability and is not directly compatible with entropy modeling, forcing prior work to regulate bitrate only indirectly. We resolve this by defining a distance-aware soft posterior over codebook indices and training a conditional autoregressive entropy model to predict it. Therefore the cross-entropy between the approximate and predicted posteriors yields a differentiable rate loss, restoring a gradient pathway from rate to the encoder via codeword distances. Such predicted codebook index distribution enables prefix-only transmission at inference, with the model imputing the rest of the indices, delivering retraining-free bitrate control over a practical range. Our end-to-end RD optimized RDVQ outperforms all baseline methods in terms of DISTS and CLIPIQA, which reflect superior structural restoration and better alignment with human visual perception on the Kodak, DIV2K and CLIC2020 datasets.

View full details

Oral

Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation

Fu Feng ⋅ Yucheng Xie ⋅ Ruixiao Shi ⋅ Xu Yang ⋅ Jing Wang ⋅ Xin Geng

Jun 6, 9:00 AM - 9:12 AM Bluebird Ballroom

Text-to-image (T2I) diffusion models effectively produce semantically aligned images, but their reliance on training distributions constrains their capacity for synthesizing truly novel, out-of-distribution concepts. Existing methods attempt to enhance creativity through semantic exploration, such as fusing known concept pairs, but the resulting images remain linguistically describable and confined to familiar semantic spaces. Inspired by the soft probabilistic outputs of classifiers on novel or out-of-distribution inputs, we propose Distribution-Conditional Generation, a paradigm that models novel concepts as image synthesis conditioned on class distributions, enabling controllable yet semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder–decoder framework that unifies conditional and unconditional creative generation by decoding latent representations—either randomly sampled or mapped from conditions (e.g., class distributions)—into tokens representing novel concepts. DisTok is trained by iteratively sampling and fusing concept pairs from a dynamic pool to model progressively complex distributions, while enforcing semantic consistency through a vision-language model that aligns the class distributions of generated images with the input distributions. Extensive experiments demonstrate that DisTok enables efficient and flexible semantic exploration for token-level creative synthesis, achieving state-of-the-art text–image alignment and human preference.

View full details

Oral

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

Huan Ren ⋅ Yihan Chen ⋅ Chuxin Wang ⋅ Nailong Liu ⋅ Wenfei Yang ⋅ Tianzhu Zhang

Jun 6, 9:00 AM - 9:12 AM Four Seasons Ballroom

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations.Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors. Our method pioneers a new direction for future research by effectively and efficiently integrating shape completion into category-level object pose estimation. Code will be open.

View full details

Oral

3D-LATTE: Latent Space 3D Editing from Textual Instructions

Maria Parelli ⋅ Michael Oechsle ⋅ Michael Niemeyer ⋅ Federico Tombari ⋅ Andreas Geiger

Jun 6, 9:00 AM - 9:12 AM Mile High Ballroom 1A - 2A

Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Code will be publicly released.

View full details

Oral

AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

Zhenglin Zhou ⋅ Fan Ma ⋅ Chengzhuo Gui ⋅ Xiaobo Xia ⋅ Hehe Fan ⋅ Yi Yang ⋅ Tat-seng Chua

Jun 6, 9:12 AM - 9:25 AM Mile High Ballroom 1A - 2A

Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. The code and models will be made publicly available.

View full details

Oral

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao ⋅ Sanghwan Kim ⋅ Yongqin Xian ⋅ Zeynep Akata ⋅ Stephan Alaniz

Jun 6, 9:12 AM - 9:25 AM Mile High Ballroom 3A - 4A

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose **FINER-Tuning**, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Benchmarks, training data, code and model checkpoints will be released.

View full details

Oral

Guiding a Diffusion Model by Swapping Its Tokens

Weijia Zhang ⋅ Yuehao Liu ⋅ Shanyan Guan ⋅ Wu Ran ⋅ Yanhao Ge ⋅ Wei Li ⋅ Chao Ma

Jun 6, 9:12 AM - 9:25 AM Bluebird Ballroom

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling toward higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar tokens in either spatial or channel dimensions.Unlike existing methods that apply perturbation in a global or less constrained manner, our approach modifies only selected tokens, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO2014, MS-COCO 2017, and ImageNet datasets demonstrate that our Self-Swap Guidance (SSG), when applied to state-of-the-art diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

View full details

Oral

CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling

Li Jin ⋅ Weikai Chen ⋅ Yujie Wang ⋅ Yingda Yin ⋅ Zeyu HU ⋅ Runze Zhang ⋅ Keyang Luo ⋅ Shengju Qian ⋅ Xin Wang ⋅ Xueying Qin

Jun 6, 9:12 AM - 9:25 AM Four Seasons Ballroom

Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.

View full details

Oral

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu ⋅ Wei Xiong ⋅ Weili Nie ⋅ Yichen Sheng ⋅ Shiqiu Liu ⋅ Jiebo Luo

Jun 6, 9:25 AM - 9:37 AM Bluebird Ballroom

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 2.21 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the $1024^{2}$ resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

View full details

Oral

ChordEdit: One-Step Low-Energy Transport for Image Editing

Liangsi Lu ⋅ Xuhang Chen ⋅ Minzhe Guo ⋅ Shichu Li ⋅ Jingchao Wang ⋅ Yang Shi

Jun 6, 9:25 AM - 9:37 AM Mile High Ballroom 1A - 2A

The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce \textbf{ChordEdit}, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.

View full details

Oral

MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction

Linjie Qu ⋅ Jin Xiao ⋅ Xiangrong Liu ⋅ Changming Sun ⋅ Hui Cui ⋅ Yuqi Fang ⋅ Ran Su ⋅ Qiangguo Jin ⋅ leyi wei

Jun 6, 9:25 AM - 9:37 AM Mile High Ballroom 3A - 4A

Multi-modal learning approaches that integrate pathological images with genomic profiles have significantly enhanced the accuracy of survival prediction tasks. However, previous methods often struggle to effectively process long-range gigapixel whole slide images (WSIs) and sparse genomic profiles due to the limitations of conventional scanning strategies to serialize data and the complex and heterogeneous nature of the modalities. Inspired by recent advancements in Mamba and mixture of experts (MoE), we propose a novel multi-directional composite scanning strategy with mixture of attention and Mamba experts (MDCS-MoAME) for cancer survival prediction. Specifically, we introduce a multi-directional composite scanning (MDCS) strategy to both WSIs and genomic profiles, and use the Mamba encoder to process intra-modal representations at the region, patch, and gene level, ensuring sufficient utilization of the intrinsic information within each modality. To further capture heterogeneous inter-modal representations, we introduce mixture of attention and Mamba experts (MoAME), which dynamically selects tailored experts to model complex inter-modal correlations, flexibly focusing on the interactions between modalities. Finally, we introduce alignment constraints to recalibrate inter-modal interactions and reduce intra- and inter-modal representation redundancy, enhancing its discriminative power for comprehensive survival analysis. Experimental results on five publicly available datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. Our code is included in the supplementary material.

View full details

Oral

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Peirong Zhang ⋅ Yidan Zhang ⋅ Luxiao Xu ⋅ Jinliang Lin ⋅ Zonghao Guo ⋅ Fengxiang Wang ⋅ Xue Yang ⋅ Kaiwen Wei ⋅ Lei Wang

Jun 6, 9:25 AM - 9:37 AM Four Seasons Ballroom

Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

View full details

Oral

RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

Haiyang Mei ⋅ Qiming Huang ⋅ Hai Ci ⋅ Mike Zheng Shou

Jun 6, 9:37 AM - 9:50 AM Four Seasons Ballroom

Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise construction of digital twins and world models for robotic applications, supports robot-centric data augmentation, and provides reliable cues for extracting robot actions and poses. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.

View full details

Oral

Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

Yihao Luo ⋅ Xianglong He ⋅ Chuanyu Pan ⋅ Yiwen Chen ⋅ Jiaqi Wu ⋅ Yangguang Li ⋅ Wanli Ouyang ⋅ Yuanming Hu ⋅ Guang Yang ⋅ Choon Hwai Yap

Jun 6, 9:37 AM - 9:50 AM Mile High Ballroom 1A - 2A

Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

View full details

Oral

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

Jiwoo Chung ⋅ Sangeek Hyun ⋅ MinKyu Lee ⋅ Byeongju Han ⋅ Geonho Cha ⋅ Dongyoon Wee ⋅ Youngjun Hong ⋅ Jae-Pil Heo

Jun 6, 9:37 AM - 9:50 AM Bluebird Ballroom

Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

View full details

Oral

PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

Bowen Sun ⋅ Yujun Cai ⋅ Ming-Hsuan Yang ⋅ Hang Wu ⋅ Yiwei Wang

Jun 6, 9:37 AM - 9:50 AM Mile High Ballroom 3A - 4A

Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.

View full details

Oral

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Yasaman Haghighi ⋅ Alex Alahi

Jun 6, 9:50 AM - 10:02 AM Bluebird Ballroom

Diffusion models achieve state-of-the-art video generation but their many sequential denoising steps create a major computational bottleneck. Existing acceleration methods reuse cached model outputs at fixed timesteps chosen through heuristics, requiring heavy tuning and failing to adapt to each sample’s complexity. We address this with a principled, sensitivity-aware caching framework. We first formalize the caching problem by analyzing the network's output sensitivity with respect to changes in its inputs—namely, the noisy latent and the timestep. We demonstrate that this sensitivity is the key indicator of caching error. Building on this insight, we introduce Sensitivity-Aware Caching ($\text{SenCache}$), a dynamic strategy that adaptively selects which timesteps to cache on a per-sample basis. This allows for less caching on challenging samples and more aggressive acceleration on simpler ones. Our method provides a robust theoretical grounding for adaptive caching, offering an explanation for why previous empirical criteria are partially effective and extending them with a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX and LTX-Video models demonstrate that our method outperforms existing caching strategies in visual quality under similar computational budgets.

View full details

Oral

S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

Han Su ⋅ Tianyu Huang ⋅ Zichen Wan ⋅ Xiaohe Wu ⋅ Wangmeng Zuo

Jun 6, 9:50 AM - 10:02 AM Four Seasons Ballroom

Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S$^2$AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training.Extensive experiments demonstrate that S$^2$AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

View full details

Oral

PAVAS: Physics-Aware Video-to-Audio Synthesis

Oh Hyun-Bin ⋅ Yuhta Takida ⋅ Toshimitsu Uesaka ⋅ Tae-Hyun Oh ⋅ Yuki Mitsufuji

Jun 6, 9:50 AM - 10:02 AM Mile High Ballroom 3A - 4A

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object–object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.

View full details

Oral

Native and Compact Structured Latents for 3D Generation

Jianfeng XIANG ⋅ Xiaoxue Chen ⋅ Sicheng Xu ⋅ Ruicheng Wang ⋅ Zelong Lv ⋅ Yu Deng ⋅ Hongyuan Zhu ⋅ Yue Dong ⋅ Hao Zhao ⋅ Nicholas Jing Yuan ⋅ Jiaolong Yang

Jun 6, 9:50 AM - 10:02 AM Mile High Ballroom 1A - 2A

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

View full details

Oral

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Arman Zarei ⋅ Samyadeep Basu ⋅ Mobina Pournemat ⋅ Sayan Nag ⋅ Ryan A. Rossi ⋅ Soheil Feizi

Jun 6, 10:02 AM - 10:15 AM Mile High Ballroom 1A - 2A

Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits.We introduce *SliderEdit*, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a *single* set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. We are the first to explore and propose a framework for continuous, fine-grained instruction control in image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

View full details

Oral

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang ⋅ Panwen Hu ⋅ Jing Wang ⋅ Terry Jingchen Zhang ⋅ Yuhao Cheng ⋅ Long Chen ⋅ Yiqiang Yan ⋅ Zutao Jiang ⋅ Hanhui Li ⋅ Xiaodan Liang

Jun 6, 10:02 AM - 10:15 AM Mile High Ballroom 3A - 4A

Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

View full details

Oral

Streaming Diffusion Model for Fast Infrared and Visible Video Fusion

Jinyuan Liu ⋅ Ludan Sun ⋅ Tengyu Ma ⋅ Chunyan Yang ⋅ Zhiying Jiang ⋅ Long Ma ⋅ Risheng Liu ⋅ Xin Fan

Jun 6, 10:02 AM - 10:15 AM Bluebird Ballroom

Infrared and visible video fusion is pivotal for robust perceptual systems, aiming to synthesize a comprehensive video stream that leverages both thermal resilience and textured details. However, prevailing methods, by treating video as independent frames, inherently introduce temporal incoherence, such as flickering and ghosting artifacts. While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. Our key insight is to distill the generative prior of a pre-trained diffusion model into a one-step sampling framework, while explicitly modeling temporal dynamics. We design a memory-augmented latent pipeline where a temporal aggregation adapter aligns and propagates cross-frame features to ensure coherence, supported by a dedicated temporal consistency loss. This approach effectively decouples the challenge of achieving high fidelity from maintaining temporal stability. Extensive experiments on four benchmarks demonstrate that our method establishes a new state-of-the-art, generating fused videos with exceptional spatio-temporal consistency at a speed suitable for real-time application.

View full details

Oral

Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance

Miao Jia ⋅ Xingchen Hu ⋅ Jiyuan Liu ⋅ Siwei Wang ⋅ Min Wang ⋅ Zijian Chen

Jun 6, 10:02 AM - 10:15 AM Four Seasons Ballroom

Anchor-based multi-view clustering methods have gained significant attention for their effectiveness of handling large-scale datasets in recent years. The performance of these method is highly dependent on anchor quality.However, current methods neglect the interactive relationships among cross-view anchors, failing to effectively discover and exploit consistent and complementary information, leading to noisy or suboptimal anchor representations. In this paper, we propose a novel scalable tensorized anchor guidance for multi-view subspace clustering, which directly couples anchors across views to improve clustering performance. Specifically, we construct a third-order anchor tensor from view-specific anchors in a low-dimensional latent space. By imposing a tensor Schatten p-norm constraint on the anchor tensor, we can explicitly capture cross-view low-rank structure and jointly exploit consistency and complementarity information among anchors. Moreover, the tensorized anchor regularizer is independent of the number of samples, which reduces both time and space complexity. Experimental results on seven datasets demonstrate that SMVS-TAG achieves superior effectiveness and stability compared to state-of-the-art large-scale MVC methods.

View full details

Oral

Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

Takumi Kawano ⋅ Kohei Miura ⋅ Daisuke Iwai

Jun 6, 2:00 PM - 2:12 PM Mile High Ballroom 1A - 2A

Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for large multi-projector environments.

View full details

Oral

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li ⋅ Qi Ma ⋅ Runyi Yang ⋅ Mengjiao Ma ⋅ Bin Ren ⋅ Nikola Popovic ⋅ Nicu Sebe ⋅ Theo Gevers ⋅ Luc Van Gool ⋅ Danda Paudel ⋅ Martin R. Oswald

Jun 6, 2:00 PM - 2:12 PM Bluebird Ballroom

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using $39.9\times$ fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

View full details

Oral

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou ⋅ Shaoyuan Xu ⋅ Manan Biyani ⋅ Moyan Li ⋅ Jia Liu ⋅ Todd C. Hollon ⋅ Bryan Wang

Jun 6, 2:00 PM - 2:12 PM Four Seasons Ballroom

Agentic vision–language models are increasingly trained to “think with images” by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

View full details

Oral

INSID3: Training-Free In-Context Segmentation with DINOv3

Claudia Cuttano ⋅ Gabriele Trivigno ⋅ Christoph Reich ⋅ Daniel Cremers ⋅ Carlo Masone ⋅ Stefan Roth

Jun 6, 2:00 PM - 2:12 PM Mile High Ballroom 3A - 4A

In-context segmentation (ICS) aims to segment arbitrary concepts, objects, parts, or personalized instances given a few annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but limits generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concept at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +6.1 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision.

View full details

Oral

NitroGen: An Open Foundation Model for Generalist Gaming Agents

Loïc Magne ⋅ Anas Awadalla ⋅ Guanzhi Wang ⋅ Yinzhen Xu ⋅ Joshua Belofsky ⋅ Fengyuan Hu ⋅ Joohwan Kim ⋅ Ludwig Schmidt ⋅ Georgia Gkioxari ⋅ Jan Kautz ⋅ Yisong Yue ⋅ Yejin Choi ⋅ Yuke Zhu ⋅ Jim Fan

Jun 6, 2:12 PM - 2:25 PM Four Seasons Ballroom

We introduce NitroGen, a video-action foundation model for generalist gaming agents, trained on 40,000 hours of gameplay videos across more than 1000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in success rates over models trained from scratch. We release the dataset, benchmark, and model weights to advance research on generalist embodied agents.

View full details

Oral

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

Nikita Araslanov ⋅ Martin Sundermeyer ⋅ Hidenobu Matsuki ⋅ David Joseph Tan ⋅ Federico Tombari

Jun 6, 2:12 PM - 2:25 PM Bluebird Ballroom

One of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

View full details

Oral

MARCO: Navigating the Unseen Space of Semantic Correspondence

Claudia Cuttano ⋅ Gabriele Trivigno ⋅ Carlo Masone ⋅ Stefan Roth

Jun 6, 2:12 PM - 2:25 PM Mile High Ballroom 3A - 4A

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training.Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which extends sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences.MARCO sets a new state of the art on SPair-71k, AP-10K and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+10.3 PCK@0.01), strongest generalization to unseen keypoints (+3.8, SPair-U) and categories (+5.6, MP-100), while remaining 3× smaller and 10× faster than diffusion-based approaches.

View full details

Oral

GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

Bei Huang ⋅ Yixin Chen ⋅ Ruijie Lu ⋅ Gang Zeng ⋅ Hongbin Zha ⋅ Yuru Pei ⋅ Siyuan Huang

Jun 6, 2:12 PM - 2:25 PM Mile High Ballroom 1A - 2A

3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.

View full details

Oral

InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

Haoming Wang ⋅ Qiyao Xue ⋅ Wei Gao

Jun 6, 2:25 PM - 2:37 PM Mile High Ballroom 1A - 2A

Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

View full details

Oral

PAI-Bench: A Comprehensive Benchmark For Physical AI

Fengzhe Zhou ⋅ Jiannan Huang ⋅ Jialuo Li ⋅ Deva Ramanan ⋅ Humphrey Shi

Jun 6, 2:25 PM - 2:37 PM Four Seasons Ballroom

Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.

View full details

Oral

PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

Minjae Lee ⋅ Sungwoo Hur ⋅ Soojin Hwang ⋅ Won Hwa Kim

Jun 6, 2:25 PM - 2:37 PM Mile High Ballroom 3A - 4A

Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM’s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples.Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

View full details

Oral

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

yepeng liu ⋅ Hao Li ⋅ Liwen Yang ⋅ Fangzhen Li ⋅ Xudi Ge ⋅ Yuliang Gu ⋅ kuang Gao ⋅ Bing Wang ⋅ Guang Chen ⋅ Hangjun Ye ⋅ Yongchao Xu

Jun 6, 2:25 PM - 2:37 PM Bluebird Ballroom

Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the Track-quality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art keypoint detection and description methods.

View full details

Oral

MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

Shiyao Li ⋅ Antoine Guédon ⋅ Shizhe Chen ⋅ Vincent Lepetit

Jun 6, 2:37 PM - 2:50 PM Mile High Ballroom 1A - 2A

Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.

View full details

Oral

RefAV: Towards Planning-Centric Scenario Mining

Cainan Davidson ⋅ Deva Ramanan ⋅ Neehar Peri

Jun 6, 2:37 PM - 2:50 PM Four Seasons Ballroom

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of $10,000$ diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from $1000$ driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community.

View full details

Oral

Linear Fundamental Matrix Estimation from 7 or 5 Points

Taci Ata Kucukpinar ⋅ Juan Mogollon ⋅ Joshua Fraser ⋅ Timothy Duff ⋅ Kannappan Palaniappan

Jun 6, 2:37 PM - 2:50 PM Bluebird Ballroom

We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SoTA) on multiple benchmarks.

View full details

Oral

R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shuaike Shen ⋅ Ke Liu ⋅ Jiaqing Xie ⋅ Shangde Gao ⋅ Chunhua Shen ⋅ Ge Liu ⋅ Mireia Crispin-Ortuzar ⋅ Shangqi Gao

Jun 6, 2:37 PM - 2:50 PM Mile High Ballroom 3A - 4A

Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce **R$^2$-Seg**, a **training-free** framework for robust OOD tumor segmentation that operates via a two-stage **Reason-and-Reject** process. First, the **Reason** step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the **Reject** step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, **R$^2$-Seg** substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.

View full details

Oral

SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang ⋅ Jiayuan Rao ⋅ Haoning Wu ⋅ Weidi Xie

Jun 6, 2:50 PM - 3:02 PM Four Seasons Ballroom

Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we present **SoccerMaster**, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via **supervised multi-task pretraining**;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as **SoccerFactory**, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.

View full details

Oral

OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Markus Gross ⋅ Sai B. Matha ⋅ Aya Fahmy ⋅ Rui Song ⋅ Daniel Cremers ⋅ Henri Meeß

Jun 6, 2:50 PM - 3:02 PM Bluebird Ballroom

Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework that is based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by projecting annotated 2D masks into the reconstructed 3D point cloud, thereby minimizing manual 3D annotation effort. Finally, we benchmark several state-of-the-art SSC methods on OccuFly using standard metrics, and highlight challenges specific to aerial viewpoints, yielding a comprehensive aerial vision benchmark that fosters holistic aerial 3D scene understanding.

View full details

Oral

Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

Jiacong Zhou ⋅ Jiaxu Miao ⋅ Yourun Lin ⋅ Xianyun Wang ⋅ Jun Xiao ⋅ Jun Yu

Jun 6, 2:50 PM - 3:02 PM Mile High Ballroom 1A - 2A

Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5\% improvement in success rate over existing methods, validating the effectiveness of our design.

View full details

Oral

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Dante Wasmuht ⋅ Otto Brookes ⋅ Maximilian Schall ⋅ Pablo Palencia ⋅ Christopher Beirne ⋅ Tilo Burghardt ⋅ Majid Mirmehdi ⋅ Hjalmar Kühl ⋅ Mimi Arandjelovic ⋅ Sam Pottie ⋅ Peter Bermant ⋅ Brandon Asheim ⋅ Yi Jin Toh ⋅ Adam Elzinga ⋅ Jason Allan Holmberg ⋅ Andrew Whitworth ⋅ Eleanor Flatt ⋅ Laura Gustafson ⋅ Chaitanya Ryali ⋅ Yuan-Ting Hu ⋅ Baishan Guo ⋅ Andrew Westbury ⋅ Kate Saenko ⋅ Dídac Surís

Jun 6, 2:50 PM - 3:02 PM Mile High Ballroom 3A - 4A

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity -- leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in $\sim$46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild. The dataset is available at [ANONYMIZED]

View full details

Oral

VGGT-Ω

Jianyuan Wang ⋅ Minghao Chen ⋅ Shangzhan Zhang ⋅ Nikita Karaev ⋅ Johannes Schönberger ⋅ Patrick Labatut ⋅ Piotr Bojanowski ⋅ David Novotny ⋅ Andrea Vedaldi ⋅ Christian Rupprecht

Jun 6, 3:02 PM - 3:15 PM Bluebird Ballroom

We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-Ω with 20$\times$ more supervised data and 100$\times$ more unsupervised data than prior work, while requiring only 30% of VGGT’s memory and running 1.6$\times$ faster at inference. As a result, VGGT-Ω establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.

View full details

Oral

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou ⋅ Yueru Luo ⋅ Han Zhang ⋅ Zeyu Jiang ⋅ Changhao Chen

Jun 6, 3:02 PM - 3:15 PM Mile High Ballroom 1A - 2A

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian–language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released.

View full details

Oral

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Yulu Gao ⋅ Bohao Zhang ⋅ Zongheng Tang ⋅ Jitong Liao ⋅ wenjun wu ⋅ Si Liu

Jun 6, 3:02 PM - 3:15 PM Mile High Ballroom 3A - 4A

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, coarse point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong zero-shot generalization. On the challenging Ego–Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego→Exo and Exo→Ego tasks, respectively, significantly outperforming prior methods. Notably, our zero-shot model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

View full details

Oral

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu ⋅ Zhexuan Xu ⋅ Xiangmin Yi ⋅ Huining Yuan ⋅ Mo Guang ⋅ Kaiwen Long ⋅ Xinlei Chen ⋅ Yi Wu ⋅ Chao Yu ⋅ Yu Wang

Jun 6, 3:02 PM - 3:15 PM Four Seasons Ballroom

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human studies, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.

View full details

Oral

Evidential Neural Radiance Fields

Ruxiao Duan ⋅ Alex Wong

Jun 7, 9:00 AM - 9:12 AM Bluebird Ballroom

Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.

View full details

Oral

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

Xiaoqi Li ⋅ Muhe Cai ⋅ Jiadong Xu ⋅ Juan Zhu ⋅ Hongwei Fan ⋅ Yan Shen ⋅ Guangrui Ren ⋅ Hao Dong

Jun 7, 9:00 AM - 9:12 AM Mile High Ballroom 1A - 2A

Vision-Language-Action (VLA) models have significantly advanced robotic agents capable of executing diverse tasks; however, they remain limited in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment.To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations.Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s.Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks.

View full details

Oral

AToken: A Unified Tokenizer for Vision

Jiasen Lu ⋅ Liangchen Song ⋅ Mingze Xu ⋅ Byeongjoo Ahn ⋅ Yanjun Wang ⋅ Chen Chen ⋅ Afshin Dehghan ⋅ Yinfei Yang

Jun 7, 9:00 AM - 9:12 AM Four Seasons Ballroom

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

View full details

Oral

BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer

Changzhou Han ⋅ Wanlun Ma ⋅ XI TANG ⋅ Kun Hu ⋅ Sheng Wen ⋅ Yang Xiang

Jun 7, 9:00 AM - 9:12 AM Mile High Ballroom 3A - 4A

Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether gloss-based or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality.We present **BoostSLT**, a context-aware framework enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces an *Energy-Aware Temporal Segmentation (EAT-Seg)* module that dynamically partitions videos into semantically coherent fragments, followed by a *Diffusion-based Semantic Reconstruction (DSR)* module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and Rouge-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT.

View full details

Oral

ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

Yuxi Mi ⋅ Qiuyang Yuan ⋅ Zhizhou Zhong ⋅ Xuan Zhao ⋅ Jiaogen Zhou ⋅ Fubao Zhu ⋅ Jihong Guan ⋅ Shuigeng Zhou

Jun 7, 9:12 AM - 9:25 AM Mile High Ballroom 3A - 4A

Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 564 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.

View full details

Oral

Global-Aware Edge Prioritization for Pose Graph Initialization

Tong Wei ⋅ Giorgos Tolias ⋅ Jiri Matas ⋅ Daniel Barath

Jun 7, 9:12 AM - 9:25 AM Bluebird Ballroom

The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. Code and models will be released.

View full details

Oral

Confusion-Aware Spectral Regularizer for Long-Tailed Recognition

Ziquan Zhu ⋅ Gaojie Jin ⋅ Hanruo Zhu ⋅ Si-Yuan Lu ⋅ Yunxiao Zhang ⋅ ZEYU FU ⋅ Ronghui Mu ⋅ Guoqiang Zhang ⋅ Zhao Sun ⋅ Yuhang Xia ⋅ Jiaxing Shang ⋅ Xiang Li ⋅ Lu Liu ⋅ Tianjin Huang

Jun 7, 9:12 AM - 9:25 AM Four Seasons Ballroom

Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by $2.37\% \sim 4.83\%$) and the fine-tuning-from-pretrained setting (by $2.42\% \sim 4.17\%$) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.

View full details

Oral

Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization

Mohammadjavad Matinkia ⋅ Nilanjan Ray

Jun 7, 9:12 AM - 9:25 AM Mile High Ballroom 1A - 2A

Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.

View full details

Oral

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Daniel Miao ⋅ Gilad Lerman ⋅ Joe Kileel

Jun 7, 9:25 AM - 9:37 AM Mile High Ballroom 1A - 2A

In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

View full details

Oral

OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

Xilong Zhou ⋅ Jianchun Chen ⋅ Pramod Rao ⋅ Timo Teufel ⋅ Linjie Lyu ⋅ Tigran Minasian ⋅ Oleksandr Sotnychenko ⋅ Xiaoxiao Long ⋅ Marc Habermann ⋅ Christian Theobalt

Jun 7, 9:25 AM - 9:37 AM Mile High Ballroom 3A - 4A

We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data.

View full details

Oral

Learning Latent Concepts for Detecting Out-of-Distribution Objects

Ting Peng ⋅ Junhao Dong ⋅ Yew-Soon Ong

Jun 7, 9:25 AM - 9:37 AM Four Seasons Ballroom

Detecting out-of-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of `unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution~(ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.

View full details

Oral

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark ⋅ Jieyu Zhang ⋅ Zixian Ma ⋅ Jae Sung Park ⋅ Rohun Tripathi ⋅ Sangho Lee ⋅ Reza Salehi ⋅ Jason Ren ⋅ Chris Dongjoo Kim ⋅ Yinuo Yang ⋅ Vincent Shao ⋅ Yue Yang ⋅ Weikai Huang ⋅ Ziqi Gao ⋅ Taira Anderson ⋅ Jianrui Zhang ⋅ Jitesh Jain ⋅ George Stoica ⋅ Ali Farhadi ⋅ Ranjay Krishna

Jun 7, 9:25 AM - 9:37 AM Bluebird Ballroom

Today’s strongest video-language models (VLMs) remain proprietary.The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe.As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.Crucially, many downstream applications require more than just high-level video understanding; they require grounding—either by pointing or by tracking in pixels. Even proprietary models lack this capability.We present Molmo2, a new family of VLMs that are state-of-the-art amongst open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks.Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs.We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme and show bi-directional attention on vision tokens and a novel token-weight strategy improve performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 outperforms larger proprietary models, including 32.9% (Molmo2) vs 17% (Gemini 2.5 Pro) on video pointing.

View full details

Oral

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

Jinlu Zhang ⋅ Zixi Kang ⋅ Libin Liu ⋅ Jianlong Chang ⋅ Qi Tian ⋅ Feng Gao ⋅ Yizhou Wang

Jun 7, 9:37 AM - 9:50 AM Mile High Ballroom 3A - 4A

Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.

View full details

Oral

Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics

Ao Luo ⋅ XIN LI ⋅ Fan Yang ⋅ Yuezun Li ⋅ Zhaoquan Yuan ⋅ SHAN ZHAO ⋅ Bing Su ⋅ Xiao WU

Jun 7, 9:37 AM - 9:50 AM Bluebird Ballroom

Modern optical flow estimation, though empowered by deep neural architectures, remains rooted in the discrete correspondence paradigm inherited from classical vision. Most networks infer frame-to-frame displacements or correlation volumes, capturing where pixels move but not how motion evolves continuously through time. Yet physical motion in the real world follows smooth dynamics governed by underlying velocity fields, as long established in fluid mechanics and transport theory. To bridge this gap, we introduce Optical Flow Matching (OFM), a continuous formulation that learns a time-dependent velocity field to transport pixel coordinates along motion distribution coherent trajectories. A key component of our OFM is Triangle Velocities Synergy (TVS), a lightweight geometric mechanism that provides a stable and physically meaningful velocity construction, ensuring that continuous transport remains well-defined. Combined with an Euler-based ODE solver, OFM yields flow fields that are temporally smooth, geometrically consistent, and process-interpretable. Experiments on Sintel, KITTI, and Spring demonstrate that OFM achieves state-of-the-art accuracy, enhanced temporal stability, and notably stronger cross-dataset generalization, advancing optical flow estimation from correspondence inference to continuous dynamical reasoning. All code and trained models will be released upon acceptance to facilitate further research.

View full details

Oral

Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery

Jizhou Han ⋅ Chenhao Ding ⋅ Yuhang He ⋅ Qiang Wang ⋅ Shaokun Wang ⋅ SongLin Dong ⋅ Yihong Gong

Jun 7, 9:37 AM - 9:50 AM Four Seasons Ballroom

Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual–textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data.

View full details

Oral

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Ziyi Chen ⋅ Yingnan Guo ⋅ Zedong Chu ⋅ Minghua Luo ⋅ Yanfen Shen ⋅ Mingchao Sun ⋅ Junjun Hu ⋅ Shichao Xie ⋅ Yang Kuan ⋅ Pei Shi ⋅ Zhining Gu ⋅ Lu Liu ⋅ Honglin Han ⋅ Xiaolong Wu ⋅ Mu Xu ⋅ Yu Zhang

Jun 7, 9:37 AM - 9:50 AM Mile High Ballroom 1A - 2A

Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.

View full details

Oral

POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Zhuo Chen ⋅ Chengqun Yang ⋅ Zhuo Su ⋅ Zheng Lv ⋅ Jingnan Gao ⋅ Xiaoyuan Zhang ⋅ Xiaokang Yang ⋅ Yichao Yan

Jun 7, 9:50 AM - 10:02 AM Mile High Ballroom 3A - 4A

Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining “chicken-and-egg’’ cycle for scalable and reproducible portrait illumination.

View full details

Oral

Structural Action Transformer for 3D Dexterous Manipulation

Xiaohan Lei ⋅ Min Wang ⋅ Bohong Weng ⋅ Wengang Zhou ⋅ Houqiang Li

Jun 7, 9:50 AM - 10:02 AM Mile High Ballroom 1A - 2A

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties.Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective.We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks.Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer.This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

View full details

Oral

SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

Junbin Su ⋅ Ziteng Xue ⋅ Shihui Zhang ⋅ Kun Chen ⋅ Weiming Hu ⋅ Zhipeng Zhang

Jun 7, 9:50 AM - 10:02 AM Bluebird Ballroom

Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Experiments show that AMG-LoRA alone establishes a remarkably simple yet strong baseline, outperforming SDSTrack on LasHeR by 3.3\% in PR and 1.9\% in SR with only 0.4\% of its parameters (0.14M vs. 14.8M), while significantly boosting cross-modal fusion with negligible additional latency. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. Code will be released.

View full details

Oral

Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Shangge Liu ⋅ Yuehan Yin ⋅ Lei Wang ⋅ Qi Fan ⋅ Yinghuan Shi ⋅ Wenbin Li ⋅ Yang Gao ⋅ Dacheng Tao

Jun 7, 9:50 AM - 10:02 AM Four Seasons Ballroom

Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($\theta_0$) or the task vectors ($\tau_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ($\Delta W$) that constitute $\tau_t$ during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods.

View full details

Oral

Understanding Task Transfer in Vision-Language Models

Bhuvan Sachdeva ⋅ Karan Uppal ⋅ Abhinav Java ⋅ Vineeth Balasubramanian

Jun 7, 10:02 AM - 10:15 AM Four Seasons Ballroom

Vision–Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer \& risks of negative interference, offering actionable guidance for advancing VLMs.

View full details

Oral

U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

Xunpei Sun ⋅ Wenwei Lin ⋅ Yi Chang ⋅ Gang Chen

Jun 7, 10:02 AM - 10:15 AM Bluebird Ballroom

Existing unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm.

View full details

Oral

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Kunwar Maheep Singh ⋅ Jianchun Chen ⋅ Vladislav Golyanik ⋅ Stephan Garbin ⋅ Thabo Beeler ⋅ Rishabh Dabral ⋅ Marc Habermann ⋅ Christian Theobalt

Jun 7, 10:02 AM - 10:15 AM Mile High Ballroom 3A - 4A

We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method’s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.

View full details

Oral

TESO: Online Tracking of Essential Matrix by Stochastic Optimization

Jaroslav Moravec ⋅ Radim Sara ⋅ Akihiro Sugimoto

Jun 7, 10:02 AM - 10:15 AM Mile High Ballroom 1A - 2A

Reliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for data-driven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectification quality, and the stereo depth consistency with respect to a 3D LiDAR. In the large-scale MAN TruckScenes dataset, TESO tracks drift with 0.12° precision in the rotation around Y, which is critical for stereo accuracy, while the other two rotation angles are tracked with five times better precision. Sequences with simulated drift are tracked with similar precision as the no-drift ones, suggesting that the tracker is unbiased. Applied to the KITTI dataset, TESO reported systematic inconsistencies in extrinsic parameters across all stereo pairs, confirming observations made by other authors. We verify that these errors were partly caused by intrinsic decalibration, which manifested in the contradictory performance of two metrics: The epipolar error and the depth estimation accuracy. With corrected calibration parameters, TESO improved its rotation precision around the hardest Y-axis by approximately twentyfold, reaching 0.025°. In the depth estimation, there was a fiftyfold improvement. Despite its lightweight nature, we show that the combination of SIFT features and the proposed TESO loss function achieves accuracy comparable to published single-frame methods that rely on neural network models.

View full details

Oral

Differentiable Laplacian Matrix Guided Superpixel Segmentation

Jeremy Juybari ⋅ Joshua Hamilton ⋅ Shuvra Das ⋅ Chaofan Chen ⋅ Andre Khalil ⋅ Yifeng Zhu

Jun 7, 2:00 PM - 2:15 PM Bluebird Ballroom

Superpixels partition an image into perceptually coherent regions, reducing the cost of downstream vision tasks. Modern deep learning methods excel at superpixel generation but often yield irregular boundaries and isolated pixels, necessitating non-differentiable post-processing to enforce connectivity. This undermines the end-to-end learning capabilities. We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. The loss is model-agnostic and can be seamlessly integrated into the training of existing architectures to improve the quality of superpixels. In addition, we introduce two novel metrics, the average stray pixel count and excess component count, to measure the quality of superpixels. We demonstrate both qualitative and quantitative improvements over state-of-the-art methods with and without enforced connectivity. Our approach represents a significant step toward eliminating non-differentiable post-processing.

View full details

Oral

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

Pablo Messina ⋅ Andrés Villa ⋅ Juan León Alcázar ⋅ Karen Sanchez ⋅ Carlos Hinojosa ⋅ Denis Parra ⋅ Alvaro Soto ⋅ Bernard Ghanem

Jun 7, 2:00 PM - 2:12 PM Mile High Ballroom 1A - 2A

Medical vision–language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present "CURE", an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance emphasizing on harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability.

View full details

Oral

Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

Romain Vo ⋅ Julián Tachella

Jun 7, 2:00 PM - 2:15 PM Mile High Ballroom 3A - 4A

Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.

View full details

Oral

CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation

Jianxiong Gao ⋅ Yichang Liu ⋅ baofeng yang ⋅ Jianfeng Feng ⋅ Yanwei Fu

Jun 7, 2:00 PM - 2:15 PM Four Seasons Ballroom

Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception.

View full details

Oral

DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging

Yuxi Ma ⋅ Sujie Liu ⋅ Jing Yang ⋅ Jiacheng Wang ⋅ Yiping Chen ⋅ Baptiste Magnier ⋅ Liansheng Wang

Jun 7, 2:12 PM - 2:25 PM Mile High Ballroom 1A - 2A

Large-scale foundation models pretrained on massive datasets have demonstrated strong generalization capabilities in medical image analysis. However, they are typically trained on static datasets and struggle to cope with the continuously evolving nature of clinical data, where new imaging devices, institutions, and disease subtypes constantly emerge. While domain-incremental learning (DIL) provides a solution for sequential adaptation without revisiting historical data, existing methods typically assume fixed label spaces and limited domain heterogeneity, restricting their applicability to real-world clinical scenarios. To address these challenges, we propose DK-DDIL, a rehearsal-free framework for dynamic DIL that integrates two synergistic modules: a Dynamic Adaptation Module (DAM) employing dynamic rank selection and adaptive regularization to flexibly allocate model capacity under domain shifts, and a Knowledge Inheritance and Refinement (KIR) module that stabilizes cross-domain knowledge transfer through selective adapter fusion and prototype-level contrastive refinement. Experiments on the Skin Pathology Diagnosis dataset, the Cyst-X 3D MRI cohort, and the OfficeHome benchmark demonstrate that DK-DDIL consistently outperforms state-of-the-art DIL approaches, highlighting its effectiveness and versatility across dynamic 2D medical, 3D medical, and natural image domains.

View full details

Oral

Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

M. Kerem Aydin ⋅ Yi-Chun Hung ⋅ Jaclyn Pytlarz ⋅ Qi Guo ⋅ Emma Alexander

Jun 7, 2:15 PM - 2:30 PM Four Seasons Ballroom

Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging using only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, and interpretable hyperspectral imaging.

View full details

Oral

FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization

Wenjie Hou ⋅ Tianxiang Chen ⋅ Feng Wang ⋅ Tiantong Wu ⋅ Zhiming Zheng ⋅ Shaoting Tang ⋅ Wei Yang Bryan Lim

Jun 7, 2:15 PM - 2:30 PM Mile High Ballroom 3A - 4A

Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.

View full details

Oral

FILTR: Extracting Topological Features from Pretrained 3D Models

Louis Martinez ⋅ Maks Ovsjanikov

Jun 7, 2:15 PM - 2:30 PM Bluebird Ballroom

Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.

View full details

Oral

Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation

Kai Zhu ⋅ Li Chen ⋅ Jun Cheng

Jun 7, 2:25 PM - 2:37 PM Mile High Ballroom 1A - 2A

Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) — a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.

View full details

Oral

SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

Yimeng Shan ⋅ Zhenbang Ren ⋅ Haodi Wu ⋅ Wenjie Wei ⋅ Rui-Jie Zhu ⋅ Shuai Wang ⋅ Dehao Zhang ⋅ Yichen Xiao ⋅ Jieyuan Zhang ⋅ Kexin Shi ⋅ Jingzhinan Wang ⋅ Jason K. Eshraghian ⋅ Haicheng Qu ⋅ Malu Zhang

Jun 7, 2:30 PM - 2:45 PM Four Seasons Ballroom

Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based \textbf{S}pike-\textbf{D}riven \textbf{T}racking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61$M$ parameters and 8.16$mJ$ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.

View full details

Oral

Learning Convex Decomposition via Feature Fields

Yuezhi Yang ⋅ Qixing Huang ⋅ Mikaela Angelina Uy ⋅ Nicholas Sharp

Jun 7, 2:30 PM - 2:45 PM Bluebird Ballroom

This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decomposition.Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats.

View full details

Oral

SimScale: Learning to Drive via Real-World Simulation at Scale

Haochen Tian ⋅ Tianyu Li ⋅ Haochen Liu ⋅ Jiazhi Yang ⋅ Yihang Qiu ⋅ Guang Li ⋅ junli wang ⋅ Yinfeng Gao ⋅ Zhang Zhang ⋅ Liang Wang ⋅ Hangjun Ye ⋅ Long Chen ⋅ Hongyang Li

Jun 7, 2:30 PM - 2:45 PM Mile High Ballroom 3A - 4A

Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing these crucial massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by ego trajectory perturbations. Furthermore, we develop a pseudo-expert trajectory generation mechanism to provide feasible action supervision for these newly simulated states to provide action supervision.Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal crucial findings of such a sim-real paradigm, includingthe design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code would be released.

View full details

Oral

LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Behzad Bozorgtabar ⋅ Dwarikanath Mahapatra ⋅ Sudipta Roy ⋅ Muzammal Naseer ⋅ Imran Razzak ⋅ Zongyuan Ge

Jun 7, 2:37 PM - 2:50 PM Mile High Ballroom 1A - 2A

Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced—high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose $\texttt{\textbf{LATA}}$ (Laplacian-Assisted Transductive Adaptation), a $\textit{training- and label-free}$ refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image–image $k$NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a $\textit{failure-aware}$ conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. $\texttt{\textbf{LATA}}$ is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across $\textbf{three}$ medical VLMs and $\textbf{nine}$ downstream tasks, $\texttt{\textbf{LATA}}$ consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that $\texttt{\textbf{LATA}}$ sharpens zero-shot predictions without compromising exchangeability.

View full details

Oral

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Pengfei Hu ⋅ Meng Cao ⋅ Yingyao Wang ⋅ Yi Wang ⋅ Jiahua Dong ⋅ Jun Song ⋅ Cheng Yu ⋅ Bo Zheng ⋅ Xiaodan Liang

Jun 7, 2:45 PM - 3:00 PM Four Seasons Ballroom

Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm—which alternates between global temporal reasoning and local frame examination—has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft’s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.

View full details

Oral

Texvent: Asynchronous Event Data Simulation via Text Prompt

Ruofei Wang ⋅ Peiqi Duan ⋅ Ka Chun Cheung ⋅ Simon See ⋅ Boxin Shi ⋅ Renjie Wan

Jun 7, 2:45 PM - 3:00 PM Mile High Ballroom 3A - 4A

Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Text-to-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache–based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent’s superior computational efficiency and its ability to generate more realistic event data than existing simulators.

View full details

Oral

Learning Eigenstructures of Unstructured Data Manifolds

Roy Velich ⋅ Arkadi Piven ⋅ David Bensaid ⋅ Daniel Cremers ⋅ Thomas Dagès ⋅ Ron Kimmel

Jun 7, 2:45 PM - 3:00 PM Bluebird Ballroom

We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers.Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension.On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.

View full details

Oral

Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence

Woohyeon Park ⋅ Jaeik Kim ⋅ Sunghwan Steve Cho ⋅ Pa Hong ⋅ Wookyoung Jeong ⋅ Yoojin Nam ⋅ Namjoon Kim ⋅ Ginny Y. Wong ⋅ Ka Chun Cheung ⋅ Jaeyoung Do

Jun 7, 2:50 PM - 3:02 PM Mile High Ballroom 1A - 2A

Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.

View full details

Oral

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Jinbo Xing ⋅ Zeyinzi Jiang ⋅ Yuxiang Tuo ⋅ Chaojie Mao ⋅ Xiaotang Gai ⋅ Xi Chen ⋅ Jingfeng Zhang ⋅ Yulin Pan ⋅ Zhen Han ⋅ Jie Xiao ⋅ Keyu Yan ⋅ Chenwei Xie ⋅ Chongyang Zhong ⋅ Kai Zhu ⋅ Tong Shen ⋅ Lianghua Huang ⋅ Yu Liu ⋅ Yujiu Yang

Jun 7, 3:00 PM - 3:15 PM Four Seasons Ballroom

Recent unified multi-modal models have made unprecedented progress in understanding and generation, yet they largely support multi-modal inputs with single-modality outputs, struggling to produce complex interleaved text–image content due to data scarcity and the difficulty of modeling long-range cross-modal context. We introduce Weaver, which frames interleaved generation as an autoregressive planning–visualization process within a unified multi-modal architecture. A planner, i.e., understanding expert, digests rich text–image context to produce visualization triggers and their dense textual guidance except for plain text, while a visualizer, i.e., generation expert, produces images conditioned on the planner’s textual guidance and visual references. This design enables decoupled learning: we train the two experts on large collections of textual planning and reference-guided image data in parallel, yielding powerful interleaved multi-modal generation capability at inference. Moreover, training the planner with datasets from diverse understanding and generation tasks equips the model with automatic task inference. To analyze and evaluate the model from multiple dimensions, we further introduce a benchmark that covers a range of everyday use cases. Extensive experiments show that, even without or with only very limited real interleaved data training, Weaver achieves superior performance on interleaved multi-modal generation.

View full details

Oral

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang ⋅ Lingdong Kong ⋅ Tianyi Yan ⋅ Hongsi Liu ⋅ Yu Yang ⋅ Ziqi Huang ⋅ Wei Yin ⋅ Jialong Zuo ⋅ Yixuan Hu ⋅ Dekai Zhu ⋅ Dongyue Lu ⋅ Youquan Liu ⋅ Guangfeng Jiang ⋅ Linfeng Li ⋅ Xiangtai Li ⋅ Long Zhuo ⋅ Lai Xing Ng ⋅ Benoit R. Cottereau ⋅ Changxin Gao ⋅ Liang Pan ⋅ Wei Tsang Ooi ⋅ Ziwei Liu

Jun 7, 3:00 PM - 3:15 PM Mile High Ballroom 3A - 4A

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce **WorldLens**, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct **WorldLens-26K**, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop **WorldLens-Agent**, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.

View full details

Oral

Mapping Networks

Lord Sen ⋅ Shyamapada Mukherjee

Jun 7, 3:00 PM - 3:15 PM Bluebird Ballroom

The escalating parameter counts in modern deep learningmodels pose a fundamental challenge to efficient trainingand resolution of overfitting. We address this by introducingthe Mapping Networks which replace the high dimensionalweight space by a compact, trainable latent vector based onthe hypothesis that the trained parameters of large networksreside on smooth, low-dimensional manifolds. Henceforth,the Mapping Theorem enforced by a dedicated MappingLoss, shows the existence of a mapping from this latentspace to the target weight space both theoretically and inpractice. Mapping Networks significantly reduce overfittingand achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with99.5%, i.e., around 500× reduction in trainable parameters.

View full details

Oral

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

Yujie Lu ⋅ Jingwen Li ⋅ Sibo Ju ⋅ Yanzhou Su ⋅ He Yao ⋅ Yisong Liu ⋅ Min Zhu ⋅ Junlong Cheng

Jun 7, 3:02 PM - 3:15 PM Mile High Ballroom 1A - 2A

Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM’s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

View full details

Oral

Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization

Bingjun Luo ⋅ Jialin Guo ⋅ Yue Yao ⋅ Xinpeng Ding

Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing content-based jailbreaks are often inconsistent and show low attack success rates (ASR) against commercial closed-source MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability. That is, from the perspective of comprehension, MLLMs can robustly understand content regardless of visual style (e.g., "pencil sketch"). However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image. We apply a Group Relative Policy Optimization (GRPO) agent, guided by a Structurally-Tiered Reward Function. This function uniquely combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model, mapping outcomes to distinct, non-overlapping reward tiers to select the most potent stylistic parameters. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks. The GRPO agent automatically discovers optimal, non-intuitive parameters, demonstrating that stylistic biases are a scalable and modular vector for red-teaming MLLMs.

View full details